Professional-Data-Engineer Practice Dumps - Verified By TopExamCollection Updated 270 Questions
Updated Professional-Data-Engineer Exam Dumps - PDF Questions and Testing Engine
The candidates must develop practical skills in the exam topics to succeed. These objectives are highlighted below:
Design Data Processing Systems
- Design Data Processing Solutions: This topic includes the individuals’ expertise in planning, distributed systems usage, choice of infrastructure, hybrid Cloud & edge computing, system availability & fault tolerance. You should also know about the architecture options, including message queues, message brokers, service-oriented architecture, middleware, and serverless function;
- Migrate Data Processing & Data Warehousing: This section includes validating migrations, migration from on-premises to Cloud, and awareness of the current state & how to migrate designs to the future state.
- Select the Relevant Storage Technologies: The considerations for this area include mapping storage systems to the business needs, data modeling, distributed systems, as well as tradeoffs, involving transactions, throughput, and latency;
- Design Data Pipeline: The focus for this subsection includes data visualization & publishing and batch & streaming data (Cloud Dataproc, Cloud Dataflow, Cloud Sub/Pub, Hadoop ecosystem, Apache Spark, Apache Beam, and Apache Kafka). It also focuses on online versus batch prediction and job orchestration & automation;
NEW QUESTION # 117
Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?
- A. Store the common data in BigQuery and expose authorized views.
- B. Store the common data in BigQuery as partitioned tables.
- C. Store the common data encoded as Avro in Google Cloud Storage.
- D. Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.
Answer: A
NEW QUESTION # 118
An organization maintains a Google BigQuery dataset that contains tables with user-level data. They want to expose aggregates of this data to other Google Cloud projects, while still controlling access to the user-level data. Additionally, they need to minimize their overall storage cost and ensure the analysis cost for other projects is assigned to those projects. What should they do?
- A. Create dataViewer Identity and Access Management (IAM) roles on the dataset to enable sharing.
- B. Create and share a new dataset and view that provides the aggregate results.
- C. Create and share a new dataset and table that contains the aggregate results.
- D. Create and share an authorized view that provides the aggregate results.
Answer: A
NEW QUESTION # 119
You have a requirement to insert minute-resolution data from 50,000 sensors into a BigQuery table. You expect significant growth in data volume and need the data to be available within 1 minute of ingestion for real-time analysis of aggregated trends. What should you do?
- A. Use the MERGE statement to apply updates in batch every 60 seconds.
- B. Use a Cloud Dataflow pipeline to stream data into the BigQuery table.
- C. Use bq loadto load a batch of sensor data every 60 seconds.
- D. Use the INSERT statement to insert a batch of data every 60 seconds.
Answer: D
NEW QUESTION # 120
Government regulations in your industry mandate that you have to maintain an auditable record of access to certain types of datA. Assuming that all expiring logs will be archived correctly, where should you store data that is subject to that mandate?
- A. Encrypted on Cloud Storage with user-supplied encryption keys. A separate decryption key will be given to each authorized user.
- B. In a bucket on Cloud Storage that is accessible only by an AppEngine service that collects user information and logs the access before providing a link to the bucket.
- C. In Cloud SQL, with separate database user names to each user. The Cloud SQL Admin activity logs will be used to provide the auditability.
- D. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability.
Answer: D
NEW QUESTION # 121
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
Scale and harden their PoC to support significantly more data flows generated when they ramp to more
than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic models they use to control
topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production
- to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
Scale up their production environment with minimal cost, instantiating resources when and where
needed in an unpredictable, distributed telecom user community.
Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
Provide reliable and timely access to data for analysis from distributed research workers
Maintain isolated environments that support rapid iteration of their machine-learning models without
affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows
each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems
both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.
Which two actions should you take? (Choose two.)
- A. Ensure each table is included in a dataset for a region.
- B. Adjust the settings for each view to allow a related region-based security group view access.
- C. Adjust the settings for each dataset to allow a related region-based security group view access.
- D. Adjust the settings for each table to allow a related region-based security group view access.
- E. Ensure all the tables are included in global dataset.
Answer: A,B
NEW QUESTION # 122
You are working on a sensitive project involving private user dat
a. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project. How should you maintain users' privacy?
- A. Grant the consultant the Cloud Dataflow Developer role on the project.
- B. Create a service account and allow the consultant to log on with it.
- C. Create an anonymized sample of the data for the consultant to work with in a different project.
- D. Grant the consultant the Viewer role on the project.
Answer: B
NEW QUESTION # 123
You plan to deploy Cloud SQL using MySQL. You need to ensure high availability in the event of a zone failure. What should you do?
- A. Create a Cloud SQL instance in a region, and configure automatic backup to a Cloud Storage bucket in the same region.
- B. Create a Cloud SQL instance in one zone, and configure an external read replica in a zone in a different region.
- C. Create a Cloud SQL instance in one zone, and create a read replica in another zone within the same region.
- D. Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.
Answer: D
Explanation:
https://cloud.google.com/sql/docs/mysql/high-availability
NEW QUESTION # 124
A TensorFlow machine learning model on Compute Engine virtual machines (n2-standard -32) takes two days to complete framing. The model has custom TensorFlow operations that must run partially on a CPU You want to reduce the training time in a cost-effective manner. What should you do?
- A. Train the model using a VM with a GPU hardware accelerator
- B. Change the VM type to n2-highmem-32
- C. Train the model using a VM with a TPU hardware accelerator
- D. Change the VM type to e2 standard-32
Answer: A
NEW QUESTION # 125
Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google BigQuery. The plan is to run this workload weekly. How should you optimize the cluster for cost?
- A. Use pre-emptible virtual machines (VMs) for the cluster
- B. Migrate the workload to Google Cloud Dataflow
- C. Use a higher-memory node so that the job runs faster
- D. Use SSDs on the worker nodes so that the job can run faster
Answer: B
NEW QUESTION # 126
You have a requirement to insert minute-resolution data from 50,000 sensors into a BigQuery table. You expect significant growth in data volume and need the data to be available within 1 minute of ingestion for real-time analysis of aggregated trends. What should you do?
- A. Use a Cloud Dataflow pipeline to stream data into the BigQuery table.
- B. Use the MERGE statement to apply updates in batch every 60 seconds.
- C. Use bq load to load a batch of sensor data every 60 seconds.
- D. Use the INSERT statement to insert a batch of data every 60 seconds.
Answer: A
NEW QUESTION # 127
Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?
- A. Dataproc Editor
- B. Dataproc Runner
- C. Dataproc Worker
- D. Dataproc Viewer
Answer: C
Explanation:
Service accounts used with Cloud Dataproc must have Dataproc/Dataproc Worker role (or have all the permissions granted by Dataproc Worker role).
NEW QUESTION # 128
You are building a model to make clothing recommendations. You know a user's fashion preference is
likely to change over time, so you build a data pipeline to stream new data back to the model as it
becomes available. How should you use this data to train the model?
- A. Continuously retrain the model on just the new data.
- B. Train on the existing data while using the new data as your test set.
- C. Continuously retrain the model on a combination of existing data and the new data.
- D. Train on the new data while using the existing data as your test set.
Answer: D
NEW QUESTION # 129
Which of the following statements about Legacy SQL and Standard SQL is not true?
- A. Standard SQL is the preferred query language for BigQuery.
- B. You need to set a query language for each dataset and the default is Standard SQL.
- C. If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
- D. One difference between the two query languages is how you specify fully-qualified table names (i.e.
table names that include their associated project name).
Answer: B
Explanation:
Explanation
You do not set a query language for each dataset. It is set each time you run a query and the default query language is Legacy SQL.
Standard SQL has been the preferred query language since BigQuery 2.0 was released.
In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard SQL, you use a period, ., instead.
Due to the differences in syntax between the two query languages (such as with project-qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql
NEW QUESTION # 130
You work for a large bank that operates in locations throughout North America. You are setting up a data storage system that will handle bank account transactions. You require ACID compliance and the ability to access data with SQL. Which solution is appropriate?
- A. Store transaction data in BigQuery. Disabled the query cache to ensure consistency.
- B. Store transaction data in Cloud SQL. Use a federated query BigQuery for analysis.
- C. Store transaction in Cloud Spanner. Use locking read-write transactions.
- D. Store transaction data in Cloud Spanner. Enable stale reads to reduce latency.
Answer: A
NEW QUESTION # 131
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
- No interaction by the user on the site for 1 hour
- Has added more than $30 worth of products to the basket
- Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?
- A. Use a sliding time window with a duration of 60 minutes.
- B. Use a fixed-time window with a duration of 60 minutes.
- C. Use a session window with a gap time duration of 60 minutes.
- D. Use a global window with a time based trigger with a delay of 60 minutes.
Answer: C
Explanation:
It will send a message per user after that user is inactive for 60 minutes. Session window works well for capturing a session per user basis.
NEW QUESTION # 132
Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?
- A. Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.
- B. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
- C. Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
- D. Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
- E. Create a Google Cloud Dataflow job to process the data.
Answer: B
NEW QUESTION # 133
The marketing team at your organization provides regular updates of a segment of your customer dataset.
The marketing team has given you a CSV with 1 million records that must be updated in BigQuery. When you use the UPDATE statement in BigQuery, you receive a quotaExceeded error. What should you do?
- A. Import the new records from the CSV file into a new BigQuery table. Create a BigQuery job that merges the new records with the existing records and writes the results to a new BigQuery table.
- B. Reduce the number of records updated each day to stay within the BigQuery UPDATE DML statement limit.
- C. Increase the BigQuery UPDATE DML statement limit in the Quota management section of the Google Cloud Platform Console.
- D. Split the source CSV file into smaller CSV files in Cloud Storage to reduce the number of BigQuery UPDATE DML statements per BigQuery job.
Answer: A
Explanation:
https://cloud.google.com/blog/products/gcp/performing-large-scale-mutations-in-bigquery
NEW QUESTION # 134
You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?
- A. Add a SideInput that returns a Boolean if the element is corrupt.
- B. Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
- C. Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.
- D. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
Answer: D
NEW QUESTION # 135
Case Study 1 - Flowlogistic
Company Overview
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background
The company started as a regional trucking company, and then expanded into other logistics market.
Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.
Solution Concept
Flowlogistic wants to implement two concepts using the cloud:
* Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
* Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.
Existing Technical Environment
Flowlogistic architecture resides in a single data center:
* Databases
8 physical servers in 2 clusters
- SQL Server - user data, inventory, static data
3 physical servers
- Cassandra - metadata, tracking messages
10 Kafka servers - tracking message aggregation and batch insert
* Application servers - customer front end, middleware for order/customs
60 virtual machines across 20 physical servers
- Tomcat - Java services
- Nginx - static content
- Batch servers
* Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) - SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
* 10 Apache Hadoop /Spark servers
- Core Data Lake
- Data analysis workloads
* 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements
* Build a reliable and reproducible environment with scaled panty of production.
* Aggregate data in a centralized Data Lake for analysis
* Use historical data to perform predictive analytics on future shipments
* Accurately track every shipment worldwide using proprietary technology
* Improve business agility and speed of innovation through rapid provisioning of new resources
* Analyze and optimize architecture for performance in the cloud
* Migrate fully to the cloud if all other requirements are met
Technical Requirements
* Handle both streaming and batch data
* Migrate existing Hadoop workloads
* Ensure architecture is scalable and elastic to meet the changing demands of the company.
* Use managed services whenever possible
* Encrypt data flight and at rest
* Connect a VPN between the production data center and cloud environment SEO Statement We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.
CTO Statement
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.
CFO Statement
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
Flowlogistic's CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they've purchased a visualization tool to simplify the creation of BigQuery reports. However, they've been overwhelmed by all the data in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?
- A. Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.
- B. Export the data into a Google Sheet for virtualization.
- C. Create an additional table with only the necessary columns.
- D. Create a view on the table to present to the virtualization tool.
Answer: D
NEW QUESTION # 136
You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. Which solution should you choose?
- A. Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.
- B. Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.
- C. Create an authorized view on the BigQuery table to control data access, and provide third-party companies with access to that view.
- D. Create a Cloud Dataflow job that reads the data in frequent time intervals, and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.
Answer: C
Explanation:
By creating an authorized view one assures that the data is current and avoids taking more storage space (and cost) in order to share a dataset. B and D are not cost optimal and C does not guarantee that the data is kept updated.
NEW QUESTION # 137
......
Google Professional-Data-Engineer certification exam tests a candidate's proficiency in using Google Cloud Platform tools and services for data processing, such as Google Cloud Dataflow, Google BigQuery, Google Cloud Dataproc, and Google Cloud Pub/Sub. Professional-Data-Engineer exam also assesses a candidate's ability to design and implement data processing systems that are secure, reliable, and cost-effective.
New (2024) Google Professional-Data-Engineer Exam Dumps: https://www.topexamcollection.com/Professional-Data-Engineer-vce-collection.html
Best Way To Study For Google Professional-Data-Engineer Exam Brilliant Professional-Data-Engineer Exam Questions PDF: https://drive.google.com/open?id=1S5OG1PVSHgiAUZ9yIQUo8IVWIJ8V0k4A

