I am about halfway through the final week of content for the Data Engineering Zoomcamp by Data Talks Club. This is a free course, taught live during the early part of the year. The course isn’t active, as it isn’t currently being taught, and there is no active cohort of live students. But all of the videos and GitHub repositories are live for the latest cohort of students.
Week One - Basics and Setup
- Introduction to Docker & Docker-Compose
- Ingestion of the Sample Data Set
- Setting up pgAdmin and Postgres
- SQL Refresher
Week Two - Workflow Ocrchestration
- Workflow Orchestration
- Installation and setup of Prefect
- Setting up ETLS flows using Prefect
- Google Cloud Storage
- Parameterizing Flows & Deployments
- Working with Schedules and Infrastructure in Prefect
- Setting up Prefect cloud
Week Three - Data Warehousing
- Data Warehousing with Big Query
- Partitioning and Clustering with Big Query
- Big Query Best Practices
- Big Query Internals
- Setting up and using ML models in Big Query
Week Four - Analytics Engineering
- What is dbt
- Setting up a dbt Project with BigQuery
- Using Postgres with dbt
- Development of dbt models
- Testing and Documenting dbt models
- Deploying a dbt project
- Visualizing Transformed Data
Week Five - Batch Processing
- Introduction to Batch Processing
- Introduction to Spark
- Installation of Spark on Windows, MacOS, and Linux
- Intro to PySpark
- Intro to Spark Dataframes
- Intro to Spark SQL
- Spark Internals
- Intro to RDDs
- Using Spark with Google Cloud Storage
- Using Spark with Google Dataproc
- Connecting Spark with Dataproc to Bigquery
Week Six - Stream Processing
- Stream Processing
- Introduction to Kafka
- Kafka Configuration
- Kafka Streams Basics
- Faust - Python Stream Processing
- PySpark - Structured Streaming
- Kafka Streams with JVM Library
- KSQL and ksql DB
- Kafka Connect
- Kafka with Docker
The focus is on open-source tools like docker, docker-compose, postgres, apache-spark, python, Kafka, and others. These tools are among the most popular in the data field right now and are precisely what a course like this should teach. Additionally, the course introduces you to a lot of these topics quickly, so you get a good feel for how they all fit together. The course also spends a good deal of time working on hands-on examples rather than just presenting slides of information for the student to remember. For my learning style, and I think that of many others, I don’t truly learn a topic unless I can work through the exercises and implement the concepts on my keyboard. This course is focused heavily on the student implementing the solution along with the instructor.
I also have enjoyed using the Google Cloud Platform, which I haven’t spent much time with. My certifications are in AWS, and I have used that quite a bit - but my exposure to GCP is limited. This has pushed me outside my comfort zone, and while I don’t know if I will be pursuing more courses in GCP or long-term plans involving that platform, I enjoyed using it and comparing it with AWS.
Many of the software versions are already out of date or don’t work with current systems. I followed along using the exact versions of Python, Pandas, Docker, Prefect, etc… and it gave me nothing but headaches with incompatibilities with libraries. If I had started by installing the latest versions of the software mentioned in the videos from the start, I would have been in far better shape along the way. I realize that no one can predict what will break in the future - but the issues I had getting the software to all work together was really frustrating.
I also found the content to be essentially basic - there was not a lot of depth in the instruction for most topics, in my opinion. It was enough to whet the appetite for further study in any of these topics, but not enough to say you know the subject. To some degree, this is true of any course, but I found that the weekly segments were too short and too narrowly focused to cover the topics adequately.
The dataset used throughout the course was inconsistent with the videos or the GitHub, making it hard to find and ingest. Scripts were provided to ingest the data and clean up data types that didn’t work as expected. This led to developing my scripts to handle these subtle differences. Overall this could be seen as positive. Needing to do this and troubleshoot software installations and configurations throughout the course taught me much about how these systems and tools worked - so this wasn’t all bad. But I would have preferred to have focused on using the tools and the data rather than spending so much time fixing the data. I get that is part of the Data Engineer’s role, but for training, I would rather focus on the topics than this kind of growth by troubleshooting.
So far, and I’m almost done, I suppose I could say overall, I am happy I spent the time going through this course. While distracting, the issues I mention above don’t take away from the fact that this course is an excellent introduction to many topics related to Data Engineering. While I will need to follow up on this course, my case with studies focused on Spark and Kafka gave me a good instruction. The other areas were good reviews or introductions to areas I don’t plan to pursue heavily now. The bottom line is if you are looking for a good intro data engineering course, this might be a nice one to start with.
I plan to finish the week six content on Kafka. I haven’t decided yet if I’m going to do the project that is part of the course when it is taught live. The live course’s final project is peer-graded and awards a certificate of completion. My attempt at this will be based on two factors:
- Can I find something interesting to do
- My desire to move on to a more in-depth study of Spark first
If I do the final project for this course, I will post more about it here.