Google Cloud Certified Professional Cloud Architect Study Guide. Dan Sullivan
data collection system. Some of the data is collected in batches when vehicles return to base, and some is collected as it is generated.
As part of planning for envisioning future needs, an architect working with TerramEarth should consider how to support vehicles sending more data directly, eventually retiring the batch data load process in favor of having real-time or near-real-time data uploads for all vehicles. This would require planning an ingest pipeline that could receive data reliably and perform any preprocessing necessary.
Services that store and analyze the data will need to scale to support millions of vehicles transmitting data. To accomplish this, consider using a Cloud Pub/Sub queue, which allows decoupling and buffering so that data is not lost if the ingestion services cannot keep up with the rate that new data is received.
TerramEarth will likely want to share the growing inventory of data with dealers. This will require integrating TerramEarth and dealer systems. The architect should gather additional details to understand how best to share data with dealers. For example, an architect may ask, “What features of the data are significant to dealers?” Also, architects should consider how dealers would access the data. Dealer applications could query TerramEarth APIs for data, or TerramEarth could send data directly to dealer data warehouses or other reporting platforms.
Each case study has examples of systems integration requirements. When considering these requirements, keep in mind the structure and volume of data exchanged, the frequency of data exchange, the need for authentication, the reliability of each system, how to prevent data loss in the event of a problem with one of the services, and how to protect services from intentional or unintentional bursts in API requests that could overwhelm the receiving services.
Data Management Business Requirements
In addition to using business requirements to understand which systems need to work together, architects can use those requirements to understand data management business requirements. At a minimum, data management considerations include the following:
How much data will be collected and stored?
How long will it be stored?
What processing will be applied to the data?
Who will have access to the data?
How Much Data Is Stored?
One of the first questions asked about data management is “How much data are we expecting, and at what rate will we receive it?” Knowing the expected volumes of data will help plan for storage.
If data is being stored in a managed service like Cloud Storage, then Google will manage the provisioning of storage space, but you should still understand the volume of data so that you can accurately estimate storage costs and work within storage limits.
It is important to plan for adequate storage capacity. Those responsible for managing storage will also need to know the rate at which new data arrives and existing data is removed from the system. This will give the growth rate in storage capacity.
How Long Is Data Stored?
It is also important to understand how long data will be stored in various storage systems. Data may first arrive in a Cloud Pub/Sub queue, but it is immediately processed by a Cloud Function that removes the data from the queue, transforms the data, and writes it to another storage system. In this case, the data is usually in the queue for a short period of time. If the ingestion process is down, data may accumulate in the Cloud Pub/Sub topic. Since Cloud Pub/Sub is a managed service, the DevOps team would not have to allocate additional storage to the queue if there is a backup of data. GCP will take care of that. They will, however, have to consider how long the data should be retained in the queue. For example, there may be little or no business value in data that is more than seven days old. In that case, the team should configure the Cloud Pub/Sub queue with a seven-day retention period.
If data is stored in Cloud Storage, you can take advantage of the service's lifecycle policies to delete or change the storage class of data as needed. For example, if the data is rarely accessed after 30 days, data can be stored in Nearline while Coldline storage is a good option for data accessed not more than once in 90 days. If it is accessed once a year or less often, then Archive storage is an appropriate choice.
When data is stored in a database, you will have to develop procedures for removing data when it is no longer needed. In this case, the data could be backed up to Cloud Storage for archiving, or it could be deleted without keeping a copy. You should consider the trade-offs between the possible benefits of having data available and the cost of storing it. For instance, machine learning models can take advantage of large volumes of data, so it may be advisable to archive data even if you cannot anticipate a use for the data at this time.
What Processing Is Applied to the Data?
There are several things to consider about how data is processed. These include the following:
Distance between the location of stored data and services that will process the data
Volume of data that is moved from storage to processing services
Acceptable latency when reading and writing the data
Stream or batch processing
In the case of stream processing, how long to wait for late arriving data
The distance between the location of data and where it is processed is an important consideration. This affects both the time it takes to read and write the data as well as, in some cases, the network costs for transmitting the data. If data will be accessed from multiple geographic regions, consider using multiregional storage when using Cloud Storage. If you are storing data in a relational database, consider replicating data to a read-only replica located in a region closer to where the data will be read. If there is a single write instance of the database, then this approach will not improve the time to ingest or update the data.
Understand if the data will be processed in batches or as a stream. Batch processes tend to tolerate longer latencies, so moving data between regions may not create problems for meeting business requirements around how fast the data needs to be processed. If data is now being processed in batch but it will be processed as a stream in the future, consider using Cloud Dataflow, Google's managed Apache Beam service. Apache Beam provides a unified processing model for batch and stream processing.
When working with stream processing, you should consider how you will deal with late arriving and missing data. It is a common practice in stream processing to assume that no data older than a specified time will arrive. For example, if a process collects telemetry data from sensors every minute, a process may wait up to five minutes for data. If the data has not arrived by then, the process assumes that it will never arrive.
A common architecture pattern is to consume data asynchronously by having data producers write data to a Cloud Pub/Sub topic and then having data consumers read from that topic. This helps prevent data loss when consumers cannot keep up with producers. Asynchronous data consumption also enables higher degrees of parallelism, which promotes scalability.
Business requirements help shape the context for systems integration and data management. They also impose constraints on acceptable solutions. In the context of the exam, business requirements may infer technical requirements that can help you identify the correct answer to a question.
Compliance and Regulation
Businesses and organizations may be subject to regulations. For example, it is likely that Mountkirk Games accepts payment using credit cards and so is subject to financial services regulations governing payment cards. Part of analyzing business requirements is to understand