Data Engineer Assignment Question 1: Scenario: You are working on a real-time data processing project for a social media platform. The platform generates a massive stream of user activity data, including posts, likes, and comments. As a data engineer, you need to design a data pipeline that can handle this continuous stream of data efficiently. Question: How would you utilize Google's Pub/Sub in this scenario to process and analyze the real-time user activity data from the social media platform? Explain the key components and steps involved in setting up the data pipeline using Pub/Sub. Question 2: Scenario: You are part of a team responsible for building and managing data pipelines for a multinational e-commerce company. The company has a diverse range of data sources, including transactional databases, customer reviews, and website clickstream data. The team is looking for a solution to streamline the data integration and transformation processes. Question: Explain how Google's Data Fusion can be leveraged in this scenario to simplify the building and management of data pipelines. Discuss the key features and advantages of using Data Fusion in an e-commerce data engineering workflow. Question 3 (Bonus Question): Scenario: You are working for a transportation company that operates a large fleet of vehicles. The company collects extensive log data from its vehicles, including GPS coordinates, engine diagnostics, and fuel consumption. The management team wants to optimize the operational efficiency of the fleet and identify potential maintenance issues proactively. Question: Describe the importance of log processing in this transportation company's data engineering workflow. Explain how log processing can be used to monitor, debug, and optimize the data pipelines that handle the vehicle log data. Provide an example of how log processing can help in identifying maintenance issues and improving operational efficiency. Please note that the scenario descriptions are provided to add context to the questions. Feel free to modify or elaborate on the scenarios based on your specific requirements or preferences. Assignment: Visualizing Data in Looker from BigQuery or Power BI Task: Using either Looker or Power BI, create a visualization dashboard that displays meaningful insights from a dataset stored in Google BigQuery. The dataset should contain relevant information that can be visualized effectively to derive valuable business insights. Instructions: Select a dataset from Google BigQuery that aligns with the assigned task. Determine the key metrics or dimensions that are relevant to the dataset and could provide valuable insights. Create an account in Looker or Power BI if you don't have one already. Connect Looker or Power BI to Google BigQuery and load the selected dataset. Design a dashboard in Looker or Power BI that includes at least three visualizations showcasing different aspects of the data. Ensure that the visualizations are visually appealing, clearly labeled, and provide meaningful insights. Annotate the visualizations with explanatory notes, highlighting the key findings or trends. Export the dashboard as a sharable link or document for evaluation. Evaluation Criteria: Your assignment will be evaluated based on the following criteria: Understanding of data visualization principles and best practices. Effectiveness and relevance of chosen visualizations in conveying insights. Clarity and conciseness of explanatory notes accompanying the visualizations. Overall design aesthetics and user-friendliness of the dashboard. Creativity and originality in presenting the data. Submission: Form URL: https://forms.office.com/r/hB3eXea8Kp Submit your assignment by providing the shareable link or document containing the visualization dashboard in Looker or Power BI. Additionally, include any necessary credentials or access permissions to view the dashboard. There are three expected deliverables from candidates: 1) Create a document on Google Docs or any other suitable tool, and attach a shareable unprotected URL in the designated field. All submissions will go through a plagiarism check and candidates should avoid using genAI. We prioritize concise and straightforward responses that focus on the main content.(Field 4 of the form). 2) Provide a live link to the dashboard graphs or alternatively, if the dashboard is not deployed, provide PDF snapshots. Please enter the link or upload the snapshots in Field 5 of the form. 3) Record a screen capture with a voice-over explaining the dashboard you have created. Upload the video to any drive or YouTube as a private video and submit the link in Field 6 of the form. Note: Ensure that any sensitive or confidential information is removed or anonymized from the dataset before submission. In case of any queries please contact the following 1)Hema hema@vigaet.com 2)Chinmay chinmay.p@vigaet.com 3)Siddesh siddesh@vigaet.com FAQ Which dataset to use in bigquery? Ans: Use any inbuilt data set provided by gcp or populate a custom one. What if we don't have access to GCP credits? Ans: If you don't have gcp credits, try to answer theoretical questions in the best way possible.If not for google’s looker use powerBI to create graphs. Which visualization tool to use out of looker and power BI? Ans: Candidates can select any one of them.Those who don't have gcp free credits can submit work of powerBI but looker is more preferable. How to submit an assignment? Ans: This document contains a form link which you can fill and submit your work