Data engineer interview questions github

Data engineer interview questions github DEFAULT


Data Engineering Cookbook

What is this Book?How to ContributeYouTubeTwitterAmazon Shop

If You Like This Book & Need More Help

Check out my Data Engineering Academy and personal Coaching at

Visit Here

  • New content every week!
  • Step by step course from researching job postings, creating and doing your project to job application tips
  • Full AWS Data Engineering example project (Azure in development)
  • 1+ hours Ultimate Introduction to Data Engineering course
  • Data Engineering Fundamentals course
  • Data Platform & Pipeline Design course
  • Apache Spark Fundamentals course
  • Choosing Data Stores Course
  • Private Member Slack Workspace (lifetime access)
  • Weekly Q&A live stream & Archive
  • Currently over 24 hours of videos

Support This Book For Free!

  • Amazon:Click Here buy whatever you like from Amazon using this link* (Also check out my complete podcast gear and books)


Basic Engineering Skills

Advanced Engineering Skills

Hands On Course

Case Studies

Best Practices Cloud Platforms

130+ Free Data Sources For Data Science

1001 Interview Questions

Recommended Books and Courses

How To Contribute

If you have some cool links or topics for the cookbook, please become a contributor.

Simply pull the repo, add your ideas and create a pull request. You can also open an issue and put your thoughts there.

Please use the "Issues" function for comments.


Everything is free, but please support what you like! Join my Patreon and become a plumber yourself: Link to my Patreon

Or support me and send a message I read on the next livestream through Link to my

Important Links

Subscribe to my Plumbers of Data Science YouTube channel for regular updates: Link to YouTube

Check out my blog and get updated via mail by joining my mailing list:

I have a Medium publication where you can publish your data engineer articles to reach more people: Medium publication

*(As an Amazon Associate I earn from qualifying purchases from Amazon This is free of charge for you, but super helpful for supporting this channel)

Coding Interview University

I originally created this as a short to-do list of study topics for becoming a software engineer, but it grew to the large list you see today. After going through this study plan, I got hired as a Software Development Engineer at Amazon! You probably won't have to study as much as I did. Anyway, everything you need is here.

I studied about 8-12 hours a day, for several months. This is my story: Why I studied full-time for 8 months for a Google interview

Please Note: You won't need to study as much as I did. I wasted a lot of time on things I didn't need to know. More info about that below. I'll help you get there without wasting your precious time.

The items listed here will prepare you well for a technical interview at just about any software company, including the giants: Amazon, Facebook, Google, and Microsoft.

Best of luck to you!

What is it?

Coding at the whiteboard - from HBO's Silicon Valley

This is my multi-month study plan for becoming a software engineer for a large company.


  • A little experience with coding (variables, loops, methods/functions, etc)
  • Patience
  • Time

Note this is a study plan for software engineering, not web development. Large software companies like Google, Amazon, Facebook and Microsoft view software engineering as different from web development. For example, Amazon has Frontend Engineers (FEE) and Software Development Engineers (SDE). These are 2 separate roles and the interviews for them will not be the same, as each has its own competencies. These companies require computer science knowledge for software development/engineering roles.

Table of Contents

The Study Plan

Topics of Study

Getting the Job

---------------- Everything below this point is optional ----------------

Optional Extra Topics & Resources

Why use it?

If you want to work as a software engineer for a large company, these are the things you have to know.

If you missed out on getting a degree in computer science, like I did, this will catch you up and save four years of your life.

When I started this project, I didn't know a stack from a heap, didn't know Big-O anything, or anything about trees, or how to traverse a graph. If I had to code a sorting algorithm, I can tell ya it would have been terrible. Every data structure I had ever used was built into the language, and I didn't know how they worked under the hood at all. I never had to manage memory unless a process I was running would give an "out of memory" error, and then I'd have to find a workaround. I used a few multidimensional arrays in my life and thousands of associative arrays, but I never created data structures from scratch.

It's a long plan. It may take you months. If you are familiar with a lot of this already it will take you a lot less time.

How to use it

Everything below is an outline, and you should tackle the items in order from top to bottom.

I'm using GitHub's special markdown flavor, including tasks lists to track progress.

Create a new branch so you can check items like this, just put an x in the brackets: [x]

Fork the GitHub repo by clicking on the Fork button.

Clone to your local repo:

Mark all boxes with X after you completed your changes:

More about GitHub-flavored markdown

Don't feel you aren't smart enough

A Note About Video Resources

Some videos are available only by enrolling in a Coursera or EdX class. These are called MOOCs. Sometimes the classes are not in session so you have to wait a couple of months, so you have no access.

It would be great to replace the online course resources with free and always-available public sources, such as YouTube videos (preferably university lectures), so that you people can study these anytime, not just when a specific online course is in session.

Choose a Programming Language

You'll need to choose a programming language for the coding interviews you do, but you'll also need to find a language that you can use to study computer science concepts.

Preferably the language would be the same, so that you only need to be proficient in one.

For this Study Plan

When I did the study plan, I used 2 languages for most of it: C and Python

  • C: Very low level. Allows you to deal with pointers and memory allocation/deallocation, so you feel the data structures and algorithms in your bones. In higher level languages like Python or Java, these are hidden from you. In day to day work, that's terrific, but when you're learning how these low-level data structures are built, it's great to feel close to the metal.
    • C is everywhere. You'll see examples in books, lectures, videos, everywhere while you're studying.
    • The C Programming Language, Vol 2
      • This is a short book, but it will give you a great handle on the C language and if you practice it a little you'll quickly get proficient. Understanding C helps you understand how programs and memory work.
      • You don't need to go super deep in the book (or even finish it). Just get to where you're comfortable reading and writing in C.
      • Answers to questions in the book
  • Python: Modern and very expressive, I learned it because it's just super useful and also allows me to write less code in an interview.

This is my preference. You do what you like, of course.

You may not need it, but here are some sites for learning a new language:

For your Coding Interview

You can use a language you are comfortable in to do the coding part of the interview, but for large companies, these are solid choices:

You could also use these, but read around first. There may be caveats:

Here is an article I wrote about choosing a language for the interview: Pick One Language for the Coding Interview. This is the original article my post was based on:

You need to be very comfortable in the language and be knowledgeable.

Read more about choices:

See language-specific resources here

Books for Data Structures and Algorithms

This book will form your foundation for computer science.

Just choose one, in a language that you will be comfortable with. You'll be doing a lot of reading and coding.




Your choice:


Your choice:

Interview Prep Books

You don't need to buy a bunch of these. Honestly "Cracking the Coding Interview" is probably enough, but I bought more to give myself more practice. But I always do too much.

I bought both of these. They gave me plenty of practice.

If you have tons of extra time:

Choose one:

Don't Make My Mistakes

This list grew over many months, and yes, it got out of hand.

Here are some mistakes I made so you'll have a better experience. And you'll save months of time.

1. You Won't Remember it All

I watched hours of videos and took copious notes, and months later there was much I didn't remember. I spent 3 days going through my notes and making flashcards, so I could review. I didn't need all of that knowledge.

Please, read so you won't make my mistakes:

Retaining Computer Science Knowledge.

2. Use Flashcards

To solve the problem, I made a little flashcards site where I could add flashcards of 2 types: general and code. Each card has different formatting. I made a mobile-first website, so I could review on my phone or tablet, wherever I am.

Make your own for free:

I DON'T RECOMMEND using my flashcards. There are too many and many of them are trivia that you don't need.

But if you don't want to listen to me, here you go:

Keep in mind I went overboard and have cards covering everything from assembly language and Python trivia to machine learning and statistics. It's way too much for what's required.

Note on flashcards:

  1. Ever after high ballet dolls
  2. John deere flash codes
  3. Grass brush clip studio paint
  4. The promised neverland nat

Data Engineering Interviews

deploy website

The open source community driven knowledge sharing project which objectives to

  • highlight main topics data engineers are dealing with on daily basis
  • simplify knowledge sharing and knowledge discovery for data professionals
  • facilitate interview preparation for data professionals
  • help aspiring data engineers and scientists to break into industry

Find full list of questions here.


It is fully community driven project - your contribution matters:

  • If you know a question you would like to share — please create a PR
  • If you know how to answer a question — please create a PR with the answer
  • If you think you can improve an answer — please create a PR with improvement suggestion
  • If you see a mistake — please create a PR and propose a fix

For updates, join our slack workspace and follow me on LinkedIn (dkisler).

Respect your peers and follow our code of conduct

List of contributors

The Last Data Science/Data Engineer Interview Prep Video You Will Ever Need To Watch

1001 Data Engineering Interview Questions

Looking for a job or just want to know what people find important? In this chapter you can find a lot of interview questions we collect on the stream.

Ultimately this should reach at least one thousand and one questions.

But Andreas, where are the answers?? Answers are for losers. I have been thinking a lot about this and the best way for you to prepare and learn is to look into these questions yourself.

This cookbook or Google will help you a long way. Some questions we discuss directly on the live stream.

Live Streams

First live stream where we started to collect these questions.

Podcast Episode: #096 1001 Data Engineering Interview Questions
First live stream where we collect and try to answer as many interview questions as possible. If this helps people and is fun we do this regularly until we reach 1000 and one.
Watch on YouTube

All Interview Questions

The interview questions are roughly structured like the sections in the "Basic data engineering skills" part. This makes it easier to navigate this document. I still need to sort them accordingly.


  • What are windowing functions?

  • What is a stored procedure?

  • Why would you use them?

  • What are atomic attributes?

  • Explain ACID props of a database

  • How to optimize queries?

  • What are the different types of JOIN (CROSS, INNER, OUTER)?

  • What is the difference between Clustered Index and Non-Clustered Index - with examples?

The Cloud

  • What is serverless?

  • What is the difference between IaaS, PaaS and SaaS?

  • How do you move from the ingest layer to the Cosumption layer? (In Serverless)

  • What is edge computing?

  • What is the difference between cloud and edge and on-premise?


Big Data

  • What are the 4 V's?

  • Which one is most important?


  • What is a topic?

  • How to ensure FIFO?

  • How do you know if all messages in a topic have been fully consumed?

  • What are brokers?

  • What are consumergroups?

  • What is a producer?


  • What is the difference between an object and a class?

  • Explain immutability

  • What are AWS Lambda functions and why would you use them?

  • Difference between library, framework and package

  • How to reverse a linked list

  • Difference between args and kwargs

  • Difference between OOP and functional programming


  • What is a key-value (rowstore) store?

  • What is a columnstore?

  • Diff between Row and

  • What is a document store?

  • Difference between Redshift and Snowflake


  • What file formats can you use in Hadoop?

  • What is the difference between a namenode and a datanode?

  • What is HDFS?

  • What is the purpose of YARN?

Lambda Architecture

  • What is streaming and batching?

  • What is the upside of streaming vs batching?

  • What is the difference between lambda and kappa architecture?

  • Can you sync the batch and streaming layer and if yes how?


  • Difference between list tuples and dictionary

Data Warehouse & Data Lake

  • What is a data lake?

  • What is a data warehouse?

  • Are there data lake warehouses?

  • Two data lakes within single warehouse?

  • What is a data mart?

  • What is a slow changing dimension (types)?

  • What is a surrogate key and why use them?


  • What does REST mean?

  • What is idempotency?

  • What are common REST API frameworks (Jersey and Spring)?

Apache Spark

  • What is an RDD?

  • What is a dataframe?

  • What is a dataset?

  • How is a dataset typesafe?

  • What is Parquet?

  • What is Avro?

  • Difference between Parquet and Avro

  • Tumbling Windows vs. Sliding Windows

  • Difference between batch and stream processing

  • What are microbatches?


  • What is a use case of mapreduce?

  • Write a pseudo code for wordcount

  • What is a combiner?

Docker & Kubernetes

  • What is a container?

  • Difference between Docker Container and a Virtual PC

  • What is the easiest way to learn kubernetes fast?

Data Pipelines

  • What is an example of a serverless pipeline?

  • What is the difference between at most once vs at least once vs exactly once?

  • What systems provide transactions?

  • What is a ETL pipeline?


  • What is a DAG (in context of airflow/luigi)?

  • What are hooks/is a hook?

  • What are operators?

  • How to branch?



  • What is Kerberos?

  • What is a firewall?

  • What is GDPR?

  • What is anonymization?

Distributed Systems

  • How clusters reach consensus (the answer was using consensus protocols like Paxos or Raft). Good I didnt have to explain paxos

  • What is the cap theorem / explain it (What factors should be considered when choosing a DB?)

  • How to choose right storage for different data consumers? It's always a tricky question

Apache Flink

  • What is Flink used for?

  • Flink vs Spark?


  • What are branches?

  • What are commits?

  • What's a pull request?


  • What is continuous integration?

  • What is continuous deployment?

  • Difference CI/CD

Development / Agile

  • What is Scrum?

  • What is OKR?

  • What is Jira and what is it used for?


Engineer questions github interview data


There are tons of great resources all over the internet. I've bookmarked hundreds of URLs and this page is my categorized collection of the references and free tools I've found to be helpful. If you're reading this and have something to add or find a dead link please send me a note. I'm continuing to add to this over time.

Analytics/Data Visualization | API Services | Career Management | Cloud Computing | Computer Science/Programming | Data Engineering | DataOps | Data Science/Machine Learning | Datasets | Website Tools

Analytics/Data Visualization

Apache Superset

Apache Superset is a modern data exploration and visualization platform that is fast, lightweight, intuitive, and loaded with options.

Google Charts

Google Charts is a JavaScript-based tool that lets people easily create a chart from some data and embed it in a web page. It’s free and has a solid library of interactive charts and data tools available for use.

Google Data Studio

Free product lets you connect to all your marketing data and turn that data into beautiful, informative reports that are easy to understand, share, and fully customizable.


Hyperquery is a collaborative workspace for analytics. Yuo can write SQL queries and consolidate your data context, automatically mapped from your favorite data tools such as Snowflake, Looker, Fivetran, Segment, and dbt. With Hyperquery, you can keep your documentation and business metadata in one place, so you can do your best work in analytics.


VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.

API Services

 [return to top]


Add high performance files and documents conversion to your website, web, or desktop application.

Google Analytics Query Explorer

This tool lets you play with the Core Reporting API by building queries to get data from your Google Analytics views (profiles). You can use these queries with any of the client libraries to build your own tools.

GraphQL Introduction

GraphQL is a query language for your API, and a server-side runtime for executing queries by using a type system you define for your data. Learn about GraphQL, how it works, and how to use it in this series of articles.

How to design a RESTful API architecture from a human-language spec

A three-post series that teaches RESTful API design to solve users’ needs with simplicity, reliability, and performance.


API for accessing current weather data for any location including over 200,000 cities.


This free Chrome extension allows developers to explore, test, and build APIs using a powerful collaborative testing and development suite.

Public APIs

Categorizes different APIs scoured from the web which make their resources available for public consumption. | or |


Enables developers to find, test, and manage API integrations from one place and provides real-time performance metrics.

RESTful Architecture

Technical documentation for RESTful web services with references and language-specific examples.

What is a REST API?

Thorough overview on what REST APIs are and how to use them.

Career Management

 [return to top]

3 Data Career Paths Decoded

Helpful article that compares and contrasts the role of a data analyst vs. data scientist vs. data engineer.

CodeFights - Practice for Technical Interviews

Technical interviews are tough. CodeFights, which is best known as a competitive coding and skill-based recruiting platform, helps developers practice for these interviews through a free platform that offers study topics and practice questions.

Coding Interview University

At a time where technology is outpacing the ability of many universities to update their course curriculum, many aspiring software engineers are seeking alternative forms of the education required to get a job. This is a popular complete computer science multi-month study plan to become a software engineer.

Data Flair Interview Questions

Frequently asked interview questions, by category. Each question is accompanied by answers shared by industry experts.

Developer Roadmap

This repository contains a set of charts demonstrating different paths to take and technologies to adopt in order to become a front-end, back-end, or dev-ops engineer. While it seems a bit overwhelming in the beginning, it is a useful guide for what’s possible and needed in this fast-changing industry. The repo gets updated every year to reflect changes in the ecosystem.


A Computer Science portal for geeks. It contains computer science and programming articles, quizzes, and practice/competitive programming questions. It also has a large database of company-specific interview questions.

Git Showcase

Free portfolio site that allows developers to easily feature projects from their GitHub repositories.

Google Cloud Certification - Data Engineer

A Google Certified Professional - Data Engineer enables data-driven decision making by collecting, transforming, and visualizing data. To earn this certification you must pass the in-person exam. This webpage offers a collection of useful training resources and reference materials aimed at achieving this certification.


HackerEarth is a network of top developers across the world. Developers participate in online coding challenges and hackathons, solve problems and discover the best jobs.


Join over 2 million developers in solving code challenges on HackerRank, one of the best ways to prepare for programming interviews.


At some point in your professional coding career, you’re going to feel stupid when you forgot some simple term. Hackterms is a crowdsourced dictionary of coding terms and serves as a sort of wiki for coding language. Programming is full of jargon and self-inflicted nomenclature wounds. Hackterms helps by returning plain-speak explanations for these.

IBM Certified Data Engineer - Big Data

This certification is intended for big data engineers. To attain IBM Certified Big Data Engineer status, candidates must pass one test.


Learn and Practice on almost all coding interview questions asked historically and get referred to the best tech companies.

Interview Cake

Free practice programming interview questions. Interview Cake helps you prep for interviews to land offers from your dream companies. is a platform where people can practice technical interviewing anonymously with engineers from top companies.

Learn To Code

Programming and computer science are becoming more popular than ever. As a result, there are an increasingly huge number of resources and tutorials being produced for beginners who want to learn to code, ranging from books to online tutorials to interactive websites to massive open online courses (MOOCS). This can be overwhelming for beginners – there are almost too many resources available, and it’s difficult to figure out where to start. This page offers a curated list of resources for both new developers and developers looking to advance their skills and learn a new language/framework.


Level up your coding skills and quickly land a job. This is a good place to expand your knowledge and get prepared for your next interview.

Mastering SQL Queries

Practice easy, medium, and hard SQL interview questions.


Practice mock interviews and coding questions online, with peers, for free. Great practice for preparing for interviews and ultimately landing your dream tech job.

Skills Index

Undoubtedly you’ve heard about the skills gap challenges in the U.S. economy. Using select data from LinkedIn and [email protected]’s proprietary analysis, the [email protected] Skills Index takes a look at supply vs. demand around specific skill sets across top industries and provides actionable recommendations for getting up to speed.


StackShare provides online software for displaying and sharing your technology stack, which is made up of the software that you use. It’s an online community that features comparisons, ratings, reviews, recommendations, and discussions of the best software tools and software infrastructure services.


For developers, this site offers public technical tests and practice interview questions. If you score well you can get free certificates to display on your online profiles.

Things You Need to Know in a Programming Interview

This article cover general tips on how you, the interviewee, can impress your interviewer during a coding session and land your dream job.

Cloud Computing

 [return to top]

AWS Explained: The Basics

A nice introductory primer for getting started in AWS and cloud computing in general.

AWS Tutorial for Beginners

AWS (Amazon Web Service) is a cloud computing platform that enables users to access on demand computing services like database storage, virtual cloud server, etc. This online course will give an in-depth knowledge on EC2 instance as well as useful strategy on how to build and modify instance for your own applications.

Computer Science/Programming

 [return to top]


A curated list of delightful VS Code packages and resources.

Beginner’s Resources to Learn Programming Languages

This blog post details some important programming languages and offers numerous links for learning more about each.


A curated collection of tutorials and free learning resources for learning to code in new languages.

Big-O Notation Cheat Sheet

This webpage covers the space and time Big-O complexities of common algorithms used in Computer Science.


Free site lets you share your code with others in CodeEnv online environments. Good for teaching, prototyping, and sharing fiddles.


Offers programming problems and lessons for challenging yourself to get to become a better coder.

Great collection of resources for exploring different careers in tech.

CS Algorithm Notes

Set of CS lecture notes, which you can use to teach yourself algorithms.

CS Playground React

A simple in-browser JavaScript sandbox for learning and practicing algorithms and data structures.

Offers sandbox environments for developers to play around with and modify live sample code for all kinds of languages. It’s also easy to share or demonstrate solutions to problems.

Git Branching Model

This post outlines a development model for git branching strategy and release management.

Git Cheat Sheet

Handy visual reference for commonly used git version control commands.

Gitignore: A Collection of .gitignore Templates

This repository is exactly what the name suggests: a collection of useful .gitignore templates. For every new project you set up as a GitHub repository, it becomes mandatory to have a .gitignore file to filter what gets uploaded. The repo contains templates for almost any language or framework.

Intro to Computer Science Terminology

A complex definition: Computer Science is the study of information technology, processes, and their interactions with the world.


Code faster in Python with intelligent snippets - Kite is a plugin for your IDE that uses machine learning to give you useful code completions for Python.

Learn Code The Hard Way

Learn Code The Hard Way courses are an effective system for learning the basics of computer programming, designed specifically for complete beginners.

Learn to Code From Home

Learning to code can be daunting, but you can do it at your own pace from the comfort of your own home. Thanks to dedicated programmers who have put time and energy into creating free online walkthroughs and guides to various programming languages, there are plenty of free resources right at your fingertips that offer hands-on activities and general overviews for beginner coding projects and advanced tasks.


A huge selection of cheat sheets for almost any current programming language and other technologies.

Project Euler

Project Euler is a series of challenging mathematical/computer programming problems that will require more than just mathematical insights to solve. Although mathematics will help you arrive at elegant and efficient methods, the use of a computer and programming skills will be required to solve most problems.

Python Challenge

Python Challenge is a game in which each level can be solved by a bit of (Python) programming. It’s a good way to practice through solving riddles.

Python Cheat Sheet

A single cheatsheet for all basic Python functions.

Python Datetime Conversion Cheat Sheet

Python cheat sheet for converting values using the datetime module.

Python Tutor

This is a free tool that helps people overcome a fundamental barrier to learning programming: understanding what happens as the computer runs each line of source code. With it, you can write Python, Java, JavaScript, TypeScript, Ruby, C, and C++ code in your web browser and visualize what the computer is doing step-by-step as it runs your code.

Real Python: Python Tutorials

Learn Python online: Python tutorials for developers of all skill levels, Python books and courses, Python news, code examples, articles, and more.

Rosetta Code Programming Tasks

Offers over 800 problems that can be solved through programming in different languages. Great for practice.

Software Literacy: Programming Learning Guide

Programmers have to know how to work within systems and networks, using different programming languages to create and adapt software that helps their employers get things done. This post offers helpful links to online tutorials and tools for learning how to code in some of the most popular programming languages.

Teach Yourself Computer Science

If you’re a self-taught engineer or bootcamp grad, you owe it to yourself to learn computer science. This guide offers the nine subjects you should learn with the best book or video lecture series for each subject. Ideally this list can be revisited throughout your career.


Topcoder is a company that administers contests in computer programming, through which prize money can be won. Competition aside, this site also offers regular challenges and matches through which you can learn new skills and hone skills you already have.

Data Engineering

 [return to top]

5 Data Engineering Projects To Add To Your Resume

One of the best ways to develop and refine data engineering skills is through real-world portfolio projects. In this article, SeattleDataGuy reviews five potential project ideas with accompanying data sources.

7 Steps to Understanding NoSQL Databases

The term NoSQL has come to be synonymous with schema-less, non-relational data storage schemes. NoSQL is an umbrella term, one which encompasses a number of different technologies. This article provides newcomers an overview of NoSQL technologies and architectures it includes.

Airflow for Beginners

Introduced by Airbnb, Airflow is a platform to schedule and monitor data pipelines. This article overviews the basic setup to run Airflow including an example use-case.

Around Data Engineering

This actively-maintained GitHub repository details the never-ending journey of learning around data engineering and machine learning.

Awesome Data Engineering

A curated list of data engineering tools for software developers.

Beginner’s Guide to Big Data Terminology

Walkthrough on some of the common lingo of big data, such as DaaS and Neural Networking.

Complete Data Engineer’s Vocabulary

Comprehensive A-Z list of different data engineering concepts and technologies with brief summaries and embedded links to learn more information.


Conduktor is the ultimate Apache Kafka desktop client for performing regular Kafka administration and development tasks.

Data Engineering Cookbook

The Data Engineering Cookbook (124 pages) - Mastering The Plumbing Of Data Science - Andreas Kretz.

Data Engineering Open Source Coursework

UC Berkeley published its spring 2021 data engineering course slides and resources. It is excellent learning material for data engineering practitioners.


Free multi-platform database tool for developers, SQL programmers, database administrators, and analysts. Supports all popular databases: MySQL, PostgreSQL, MariaDB, SQLite, Oracle, DB2, SQL Server, Sybase, MS Access, Teradata, Firebird, Derby, etc.

How to Become a Data Engineer

This article outlines helpful considerations if you’re interested in pursuing a career as a data engineer such as required skills, typical responsibilities, and career outlook.

Introduction to Apache Airflow

Apache Airflow is hot right now and if you’re interested in learning about it from a high-level, this a nice ‘intro’ guide.

Kafka for Beginners

Apache Kafka is used for enabling communication between producers and consumers using message-based topics. Apache Kafka is a fast, scalable, fault-tolerant, publish-subscribe messaging system.

OLAP Cubes

A nice intro guide on what these are and why they are used.


Working with some messy address or name data? It helps to split each one into separate components. Parserator is a framework for making parsers using natural language processing (NLP) methods.


This page tries to collect the libraries for the queueing systems (job, messaging, etc.) that are widely popular and have a successful record of running on (big) production systems.

Self-Study List for Data Engineers and Aspiring Data Architects

With the explosion of “Big Data” over the last few years, the need for people who know how to build and manage data-pipelines has grown. This article takes a look at the sought after job skills for these areas and how you can go about learning these.

SQL Cheat Sheet

In this guide, you’ll find a useful cheat sheet that documents some of the more commonly used elements of SQL, and even a few of the less common.

SQL Query Optimization Techniques

Tips for writing faster and more efficient SQL queries

SQL vs NoSQL — What is better for you?

Important considerations when choosing a relational (SQL) or non-relational (NoSQL) data structure for your project architecture.

SqlDBM - SQL Database Modeler

SqlDBM offers you an easy, convenient way to design your database absolutely anywhere on any browser, working away without need for any extra database engine or database modelling tools or apps.

Stream Processing 101: From SQL to Streaming SQL in 10 Minutes

We have entered an era where competitive advantage comes from analyzing, understanding, and responding to an organization’s data. When doing this, time is of the essence, and speed will decide the winners and losers. This blog post introduces technologies we can use for stream processing.

Three Questions to Help You Prepare for a Data Engineering Interview

Data engineering requires a combination of knowledge, from data warehousing to programming, in order to ensure the data systems are designed well and are as automated as possible. The question is: How do you prepare for an interview for a data engineering position?

Transactional vs. Analytical Processing

Good cross-comparison between OLTP and OLAP systems.

What Does a Data Engineer / Data Architect Do?

This post explores the path of becoming a data engineer / big data architect.

What Is Big Data?

Analyzing lots of data is only part of what makes big data analytics different from previous data analytics. This article delves into what those other aspects are.

What Is ETL?

ETL is shorthand for the extraction, transformation, and loading process used in most data movement operations. This article provides a nice overview for those wanting to understand the basics around these phases.


 [return to top]

DataOps Manifesto

Through firsthand experience working with data across organizations, tools, and industries a group of professionals have uncovered a better way to develop and deliver analytics through an emerging practice called DataOps.

Dolt for Data Version Control

Dolt is the true Git for data experience in a SQL database, providing version control for schema and cell-wise for data, all optimized for collaboration. With Dolt, you can view a human-readable diff of the data you received last time versus the data you received this time. You can easily see updates you did not expect and fix the problem before you deploy the new data.

Monte Carlo for Data Observability

Monte Carlo is on a mission to accelerate the world’s adoption of data by minimizing data downtime. The platform brings full observability to data teams by monitoring, alerting, resolving, and preventing data quality issues, helping them achieve data reliability.

Data Science/Machine Learning

 [return to top]

Best Practices for ML Engineering

This guide is intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning. If you have taken a class in machine learning, built, or worked on a machine-learned model, then you have the necessary background to read this document.

Data Mining in Python: A Guide

Data mining is the process of discovering predictive information from the analysis of large databases. This guide provides an example-filled introduction to data mining using Python, one of the most widely used data mining tools - from cleaning and data organization to applying machine learning algorithms.

Efficient Python Tricks and Tools for Data Scientists

Interactive web book written by Khuyen Tran that teaches readers efficient methods of coding in Python to solve common problems in data science.

Foundations of Machine Learning

This training course offered through Bloomberg covers a wide variety of topics in machine learning and statistical modeling. The primary goal of the class is to help participants gain a deep understanding of the concepts, techniques and mathematical frameworks used by experts in machine learning. It is designed to make valuable machine learning skills more accessible to individuals with a strong math background, including software developers, experimental scientists, engineers, and financial professionals.

Handy Python Libraries for Formatting and Cleaning Data

Data scientists spend a lot of time cleaning messy data. This is a list of Python libraries that help make data more orderly and legible - from styling DataFrames to anonymizing datasets.


Offers a means of learning data science through both public and private competitions.


This website looks like its design hasn’t changed since the 90s, but it is home to lots of great content on business analytics, big data, data mining, and data science.

R or Python for Data Science?

This is a nice blog post on that digs into the differences/advantages of using either R or Python for performing data science tasks.


Regular expressions are extremely useful in extracting information from text such as code, log files, spreadsheets, or even documents. This site offers an interactive tutorial and practice exercises to help you learn them.


Open source R packages that allow access to data repositories and provide programmatic access to a variety of scientific data and other real-time metrics of scholarly impact.

TensorFlow Playground

TensorFlow is a machine learning library that underlies many of Google’s products. With this playground site, you can tinker with a neural network right in your browser.


 [return to top]

Appen AI Resource Center

Collection of free, downloadable, and categorized datasets that have been created and curated for teams working on world-class AI applications.


This GitHub repo contains a list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses.

AWS Registry of Open Data

This registry exists to help people discover and share datasets that are publicly available via AWS resources.

Data.World: The Social Network for Data People

Discover and share cool data, connect with interesting people, and work together to solve problems faster. Users can find and use a vast array of high-quality open data.


Gapminder fights devastating misconceptions and promotes a fact-based worldview everyone can understand. Many of their datasets around how the world lives have been many publicly available for download.

Google Dataset Search

Free tool for searching over 25 million publicly available datasets. The search tool includes filters to limit results based on their license (free or paid), format (csv, images, etc), and update time. The results also include descriptions of the dataset’s contents as well as author citations.

Google’s My Activity Page

This portal reveals everything Google knows about you - every search you’ve made, the apps you’ve used, the videos you’ve watched, and everything in between. Visit to see how your data is being collected, modify activity settings, and delete data that you prefer not retained.

Website Tools

 [return to top]


Enter the URL of a website and quickly find a list of the technologies used to support that site including email services, nameserver providers, JavaScript libraries, widgets, server information, and more.

Font Awesome

Font Awesome makes it easy to add vector icons and social logos to your website.

Google Design: Resizer

An interactive viewer to see and test how digital products respond to material design breakpoints across desktop, mobile, and tablet.

How To Use GitHub Pages To Make Websites

Step-by-step tutorial to getting started with building a website hosted on Github Pages.

How to Host Your Static Site with HTTPS on GitHub Pages and CloudFlare

While GitHub offers free static website hosting and custom domain support, it is currently not possible to configure HTTPS for custom domains directly through GitHub Pages. This is where CloudFlare comes in.


Lighthouse is an open-source, automated tool for improving the quality of web pages. You can run it against any web page, public or requiring authentication. It has audits for performance, accessibility, progressive web apps, and more.

Mobile Website Speed Testing Tool

Another great Google product. Find out how well your site works across mobile and desktop devices by simply entering the URL.

Static Site Generators

A leaderboard of the top open-source static site generators based on Github stars.

Website Grader

Free online tool that grades any website against key metrics such as performance, mobile readiness, SEO, and security.

Who Is Hosting This

Allows a user to simply enter the domain name of any site and instantly uncover the identity of the company that is hosting the site.

Git Interview Questions - Git Real-Time Interview Questions \u0026 Answers - DevOps Tools - Simplilearn

I saw how Laura's pussy tightly wraps around his penis, how Igor's penis slowly, centimeter by centimeter, penetrates into my wife and disappears into her tender and wet cave. He has already half entered my wife. This was enough for him to drive him to the ground in Laura's crack at the next push. After he drove his monster completely into it, Igorek stopped, gave my wife a break and get used to it.

You will also like:

Younger sister. - Andrey tried to get out. - I am the girl who has been waiting for you for so long. Didn't even kiss anyone.

3969 3970 3971 3972 3973