Whose data engineer are you?

Who are you, a data engineer?

In this article, I will explain the data roles that exist today, and in particular — who is a data engineer? What are the role definitions, responsibilities, and challenges involved?

Photo: Christina onUnsplash @ wocintechchat.com

Over the past few years, I’ve been working as a big data engineer, and while it sounds like a buzzword at the moment, I’ve come to realize that many of my colleagues in the software world don’t necessarily understand the content of this role.

Some people confuse it with DevOps, or data analytics, or data science. Some will see this as a new brand for the mythical database architect (DBA) role.

So after finding myself explaining what I do many times to many different people and why it’s different from the ones I mentioned earlier — I realized that there are probably others out there who would love to know what data engineering is all about.

But in the first place. What data roles exist today?

To be honest, the confusion is quite understandable — today, many companies have realized the importance of data to their organizations, in a world where every basic action is translated into data and used by people, and almost every company has a data set where roles are defined a little differently.

Sometimes, the data group will work as a group for the entire company, usually in a small company with a specific domain, but as the company grows, it is likely that there will be a dedicated data group for each department with a specific domain to master the data flow.

These are the key roles in the data set.

Data Analyst – Data analyst’s job is to turn information into knowledge, identify trends, and use the analyzed data as a strategic engine to make better data-based business decisions. Its main tools will be databases, SQL and HIVE queries, as well as graphical dashboards for data visualization.
Data scientists — use data-driven algorithms, machine learning to solve business problems, often with extensive knowledge of statistics and mathematics, to look for trends and patterns in data to take the company’s interests to the next level.
** Data Engineers -** Build and maintain data infrastructure, such as data pipelines, responsible for moving data from different sources to a place used by other roles, in preparation for data scientists to build models.

Photo credit: Author

Type of data engineer

Data engineers can not only “get” the data, they can also give you easy access to the data and collect the latest data at any time, even in real time.

The classic “data engineer “– data pipeline engineer

Most work is based on moving data from multiple sources to a single target, and in many cases they will primarily use ETL** (ETL** stands for Extract, Transform, and Load, which refers to extracting data from multiple sources and transforming it for business needs, And load into the target database) or build and maintain such.

This type of data engineer requires a strong understanding of relational databases, especially SQL queries.

Machine learning data engineer

The primary role of this type of person is to deploy the model (developed by the data scientist) into the field production environment, including all of this — setting up a production infrastructure that includes automation, testing, monitoring, and logging.

Machine learning engineers will be involved in writing the code for training and preparing models (the data preparation and training layer in big data solutions), in which a strong Python, Spark and cloud environment background is a must.

Key skills of data engineer

Data scientists typically have a strong background in mathematics and statistics, while data engineers are typically software developers with several years of experience, knowledge of cloud infrastructure and development languages such as Python or Java, Scala, etc.

Since we live in a world of big data, which is usually managed in the cloud, knowledge of one of these vendors would be useful — like Google Cloud Services, Azure or AWS.

In addition, database knowledge is one of the things you need to do your job — understand relational and non-relational databases, and run complex queries to retrieve data, all without affecting the data used in a production environment.

In some cases, depending on the project the engineer is working on, a basic understanding of machine learning algorithms, statistical models, and various mathematical functions is required.

Photo credit: Author

The challenges of being a data engineer

reliability

The most important thing in the data world is data reliability — no complex model can do anything if your data is corrupted. Because data engineers are responsible for collecting data, sometimes from different sources, and moving them to a target, transforming and processing them to create unity, and much more, there are concerns that data reliability can be compromised along the way.

It’s a big challenge to make sure that we’re not changing the nature of the data along the way, that what we receive is the same as what we deliver.

To provide a high degree of certainty, we must take actions along the way, for example.

Data consistency – means that each variable in the entire data has a single meaning. To ensure data reliability, we must verify schema consistency — each record is treated the same for a particular schema.
Metadata repository – Provides context for the data by keeping metadata in order from where it comes from and how it is processed.
** Data modification permissions -** Only those authorized to modify data can do this – both the person and the process. This will ensure that no unexpected changes occur.

Scalability and performance analysis

Sometimes, the amount and speed of incoming data can be unpredictable, and one of the challenges of this role is to build a system that knows how to handle the increased load easily and quickly.

It is important to understand that there is no magic solution to scale, but solutions will be given based on the problem – how can you handle the load? For example, if your system is a network API, load may affect response time, so the solution should be at this level.

repeatability

Data is the foundation of everything. Therefore, people should be prepared for the loss of some data for a variety of reasons. Therefore, the ability to recover effectively and quickly, and maintain data availability over time, is an important challenge for data engineers.

conclusion

All in all, confusion about who data engineers are and what their responsibilities are is understandable. This is an interesting and diverse role, including coding as well as cloud infrastructure maintenance and setup, database complexity, and in some cases statistics and machine learning.

Users of your data infrastructure trust you to provide them with a system that is reliable, capable of handling sudden loads quickly without losing critical data, and capable of recovering information in unexpected circumstances. The many challenges at work add a lot of interest and an impressive learning curve — the world of data is evolving rapidly, and to stay consistent we have to stay consistent with the changes and technologies we face as a solution. In addition, the role with a lot of responsibility — for example, the reliability of data is a real challenge, in the least under the “worst”, which means that a lot of money loss, in other cases, may produce legal consequence because of incorrect data, or wrong decision will make people lost their lives (for example, Sensors mounted on gas tanks and real-time leak reporting can miss real-time disaster alerts if the data is translated differently during the process).

Who is your data engineer? Published on Medium by Towards Data Science, people continue the conversation by highlighting and responding to the story.