Data Engineer

Data Engineer is responsible for laying the foundations for the acquisition, storage, transformation, and management of data in an organization. Data Engineer in charge of developing and maintaining the database architecture and data processing systems. This infrastructure is key to ensure that the development of high-level data applications, such as data visualization, and the deployment of machine learning models is carried out in a seamless, secure, and effective way.

I am a certified DataCamp Data Engineer Associate. As a Data Engineer Associate I can demonstrate that I have the knowledge, skills, and abilities to succeed at the entry level in this role. The competency domains assessed included, but were not limited to:

  • Data Management
  • Programming for Data Engineering
  • Exploratory Analysis

 

As a Data Engineer I can:

  • working on data mapping, data integrations, and ingestion, data processing and data automation,
  • designing and building pipelines to run asynchronous data processing jobs triggered from a user interface request,
  • handling multiple GCP, AWS or Azure services,
  • building, improving and testing code that moves and manipulates data coming from disparate sources, including massive log and event streams, SQL databases, and online, API-based services,
  • designing and implementing data storage,
  • designing and developing data processing,
  • monitoring and optimizing data storage and data processing.
Microsoft Azure

My Data Engineering Learning Path

With Data CampMicrosoft Learn and Coursera I build my skills and experience and validate my knowledge:

Data Engineer (Datacamp) 57 hours (skill trackcertificate)

In this course I grew my language skills as I work with Shell, SQL, and Scala, to create data engineering pipelines, automate common file system tasks, and build a high-performance database. Through hands-on exercises, I added cloud and big data tools, such as AWS Boto, PySpark, Spark SQL, and MongoDB, to my data engineering toolkit to help me create and query databases, wrangle data, and configure schedules to run my pipelines. By the end of this track, I had mastered the critical database, scripting, and process skills I need to progress my career and have a firm grasp of Python for data engineering.

In this course, I learned about a data engineer’s core responsibilities, how they differ from data scientists, and facilitate the flow of data through an organization. Through hands-on exercises I followed Spotflix, a fictional music streaming company, to understand how their data engineers collect, clean, and catalog their data. By the end of the course, I understood what my company’s data engineers do, be ready to have a conversation with a data engineer, and have a solid foundation to start your own data engineer journey.

I learned how to structure and query relational databases using SQL.

I learned how to filter and compare data, how to use aggregate functions to summarize data, how to sort and group data, how to present data cleanly using tools such as rounding and aliasing.

In this course, I learned how to work with more than one table in SQL, use inner joins, outer joins and cross joins, leverage set theory, including unions, intersect, and except clauses, create nested queries.

I learned how to create tables and specify their relationships, as well as how to enforce data integrity. I also discovered other unique features of database systems, such as constraints.

I learned how to process, store, and organize data in an efficient way. I learned how to structure data through normalization and present data with views. Finally, I learned how to manage the database and all of this did on a variety of datasets from book sales, car rentals, to music reviews.

Python is a general-purpose programming language that is becoming ever more popular for data science. Companies worldwide are using Python to harvest insights from their data and gain a competitive edge. This course focused on Python specifically for data science. I learned about powerful ways to store and manipulate data, and helpful data science tools to begin conducting my own analyses.

In this course I discovered how dictionaries offer an alternative to Python lists, and why the pandas dataframe is the most popular way of working with tabular data. In the second chapter of this course, I found out how I can create and manipulate datasets, and how to access them using these structures.

As a data scientist, I needed to clean data, wrangle and munge it, visualize it, build predictive models, and interpret these models. I also needed to know how to get data into Python. In this course, I learned the many ways to import data into Python: from flat files such as .txt and .csv; from files native to other software such as Excel spreadsheets, Stata, SAS, and MATLAB files; and from relational databases such as SQLite and PostgreSQL.

In this course, I extended my knowledge base by learning to import data from the web and by pulling data from Application Programming Interfaces— APIs—such as the Twitter streaming API, which allows us to stream real-time tweets.

In this course I learned about the many advantages including ease of remote collaboration, how there are no hardware limitations, and reliable disaster recovery. I also discovered the range of tools provided by major cloud providers such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud. I also was able to confidently explain how cloud tools can increase productivity and saved money, as well as ask the right questions about how to optimize your use of cloud tools.

Data cleaning is an essential task in data science. Without properly cleaned data, the results of any data analysis or machine learning model could be inaccurate. In this course, I learned how to identify, diagnose, and treat a variety of data cleaning problems in Python, ranging from simple to advanced. I dealt with improper data types, checked that your data is in the correct range, dealt missing data, performed record linkage, and more.

In this course, I learned how to use Python’s built-in data structures, functions, and modules to write cleaner, faster, and more efficient code. I explored how to time and profile code in order to find bottlenecks. Then, I practiced eliminating these bottlenecks, and other bad design patterns, using Python’s Standard Library, NumPy, and pandas. After completing this course, I have the necessary tools to start writing efficient Python code.

In this course i have learned you how to build pipelines to import data kept in common storage formats. I used pandas, a major Python library for analytics, to get data from a variety of sources, from spreadsheets of survey responses, to a database of public service requests, to an API for a popular review site. I also learned how to fine-tune imports to get only what I need and to address issues like incorrect data types.

In this course I learned how to choose the best visualization for my dataset, and how to interpret common plot types like histograms, scatter plots, line plots and bar plots. I also learned about best practices for using colors and shapes in my plots, and how to avoid common pitfalls.

I this course I discovered the importance of version control when working on data science projects and explore how I can use Git to track files, compare differences, modify and save files, undo changes, and allow collaborative development through the use of branches. I gained an introduction to the structure of a repository, how to create new repositories and clone existing ones, and show how Git stores data. I gained the skills to handle conflicting files.

Data scientists can experience huge benefits by learning concepts from the field of software engineering, allowing them to more easily reutilize their code and share it with collaborators. In this course, I learned all about the important ideas of modularity, documentation, & automated testing, and I saw how they can help you solve Data Science problems quicker and in a way that made future I happy.

Microsoft Azure Data Engineering Associate (Microsoft, Coursera) 113 hours (course ⇒ certificate)

This Professional Certificate is intended for data engineers and developers who want to demonstrate their expertise in designing and implementing data solutions that use Microsoft Azure data services.

This Professional Certificate helped me develop expertise in designing and implementing data solutions that use Microsoft Azure data services. I learned how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions that use Microsoft Azure data services.

⇒ Verify at: Coursera

My badges:

Some articles about Data Engineering: