Accreditation Bodies
Accreditation Bodies
Accreditation Bodies
Supercharge your career with our Multi-Cloud Engineer Bootcamp
KNOW MOREA data engineer develops systems for collecting, managing, and converting raw data into information that can be interpreted in a variety of ways by data scientists and business analysts. Organizations use this data to evaluate and optimize their performance with their ultimate goal of making it accessible. This makes data engineers a hot career prospect. We have put together a comprehensive list of Data Engineer Interview Questions and Answers for beginner, intermediate and experienced data engineers. These important questions are categorized for quick browsing before the interview or as a helping guide on various Data engineering topics like Big Data, Hive, Hadoop, Python, SQL, database, etc, with ease. These interview questions and answers will boost your knowledge in the field, improve your core interview skills, and help you perform better in interviews related to Data Engineering. Also, to get your concepts up to speed, you can start your training with our big data and Hadoop training course.
Filter By
Clear all
This may seem like a pretty basic question, but regardless of your skill level, this is one of the most common questions that can come up during your interview. So, what is it? Briefly, Data Engineering is a term used in big data. It is the process of transforming the raw entity of data (data generated from various sources) into useful information that can be used for various purposes.
Data modelling is the scientific process of converting and transforming complex software data systems by breaking them up into simple diagrams that are easy to understand, thus making the system independent of any pre-requisites. You can explain any prior experience with Data Modelling, if any, in the form of some scenarios.
Companies can ask you questions about design schemas in order to test your knowledge regarding the fundamentals of data engineering. Data Modelling consists of mainly two types of schemas:
The difference between structured and unstructured data is as follows-
Parameter | Structured Data | Unstructured Data |
Storage | DBMS | File structures are unmanaged |
Standard | ODBC, ADO.net, and SQL | XML, STMP, CSV, and SMS |
Integration Tool | ELT (Extract, Transform, Load) | Batch processing or Manual data entry |
Scaling | Schema scaling is difficult | Schema Scaling is very easy. |
Version management | Versioning over tuples, row and tables | Versioned as a whole is possible |
Example | An ordered text dataset file | Images, video files, audio files, etc. |
In today’s world, the majority of big applications are generating big data that requires vast space and a large amount of processing power, Hadoop plays a significant role in providing such provision to the database world.
This is one of the most frequently asked data engineer interview questions for freshers in recent times. Here are the components of a Hadoop application.
A Hadoop Application is consist of -
NameNode is the master node in the Hadoop HDFS Architecture. It is used to store all the data of HDFS and also keep track of various files in all clusters. The NameNodes don’t store the actual data but only the metadata of HDFS. The actual data gets stored in the DataNodes.
Don't be surprised if this question pops up as one of the top Python interview questions for big data engineer in your next interview.
Hadoop streaming is one of the widely used utilities that comes with the Hadoop distribution. This utility is provided for allowing the user to create and run Map/Reduce jobs with the help of various programming languages like Ruby, Perl, Python, C++, etc. which can then be submitted to a specific cluster for usage.
Some of the important features of Hadoop are as below:
Expect to come across this, one of the most important data engineer interview questions for experienced professionals in data engineering, in your next interviews.
Blocks are considered as the smallest unit of data that is allocated to a file that is created automatically by the Hadoop System for storage of data in a different set of nodes in a distributed system. Large files are automatically sliced into small chunks called as blocks by Hadoop.
Block scanner as its name suggests, is used to verify whether the small chunks of files known as blocks that are created by Hadoop are successfully stored in DataNode or not. It helps to detect the corrupt blocks present in DataNode.
Following are the steps followed by the block scanner when it detects a corrupted DataNode block-
This whole process helps HDFS in maintaining the integrity of the data during read operation performed by a client.
A must-know for anyone looking for top data engineer interview questions, this is one of the frequently asked big data engineer interview questions.
Below are the steps to achieve security in Hadoop:
NameNode communicates and gets information from DataNode via messages or signals.
There are two types of messages/signals that are used for this communication across the channel:
The default ports for Task Tracker, Job Tracker, and NameNode in Hadoop are as below:
This question is asked by interviewers to check your understanding of the role of a data engineer.
The difference between NAS and DAS is as follows:
NAS | DAS |
NAS stand for Network Attached Storage | DAS stands for Direct Attached Storage |
Storage capacity of NAS is between 109 to 1012 in byte. | Storage capacity of DAS is 109 in byte. |
In NAS, Storage is distributed over distinct servers on a network | In DAS, storage is attached to the node where computation process is taking place. |
It has moderate storage management cost | It has high storage management cost |
Data transmission takes place using Ethernet or TCP/IP. | Data transmission takes place using IDE/ SCSI |
Below are various fields or languages used by data engineer:
In Hadoop, Rack awareness is the concept of choosing the DataNodes which are closer according to the rack information. By default, Hadoop assumes that all the nodes belong to the same rack.
In order to improve the network traffic while reading/writing HDFS files that are on the same or a nearby rack, NameNode uses the DataNode to read/ write requests. To achieve rack information, the rack ids of each DataNode are maintained by HDFS NameNode. This concept in HDFS is known as Rack Awareness.
When NameNode is down, it means that the entire cluster is down. So, the cluster won’t be accessible as it is down. All the services which are running on that cluster will also be down. So, in this scenario, if any user tries to submit a new job will get an error and job will get failed. All the existing jobs which are running will also get failed.
So briefly, we can say that when NameNode will get down, all the new, as well as existing jobs, will get failed as all services will be down. The user has to wait for the NameNode to restart and can run a job once the NameNode will get up.
Four Vs of big data describes four dimensions of big data. These are listed below:
The various XML configuration files present in Hadoop are as follows:
The main methods of reducer are given below:
One of the most frequently posed data engineer scenario based questions, be ready for this conceptual question.
FIFO also known as First In First Out is the simple job scheduling algorithm in Hadoop which implies that the tasks or processes that come first will be served first. In Hadoop, FIFO is the default scheduler. All the tasks or processes are placed in a queue and they get their turn to get executed according to their order of submission. There is one major disadvantage of this type of scheduling which is that the higher priority tasks have to wait for their turn which can impact the process.
Hadoop operations can be used in three different modes. These are listed below:
This, along with other data engineer questions for freshers, is a regular feature in data engineer interviews, be ready to tackle it with the approach mentioned below.
In Hadoop, replication factor depicts the number of times the framework replicates or duplicates the Data blocks in a system. The default replication factor in Hadoop is 3 which can be manipulated as per the system requirements. The main advantage of the replication process is to ensure data availability.
We can configure the replication factor in hdfs-site.xml file which can be less than or more than 3 according to the requirements.
In Hadoop, the primary phases of reducer are as follows:
The distance between two nodes is equal to the simple sum of the distance to the closest nodes. In order to calculate the distance between two nodes, we can use getDistance() method for the same.
In Hadoop, Context object is used along with the Mapper class so that it can interact with the other remaining parts of the system. Using the Context object, all the jobs and the system configuration details can be easily obtained in its constructor.
Information can be easily passed or sent to the methods like cleanup(), setup() and map() using the Context object. During map operations, vital information can be made available using the Context object.
In Apache Hadoop, Safe mode is a mode that is used for the purpose of maintenance. It acts as read-only mode for NameNode in order to avoid any modifications to the file systems. During Safe mode in HDFS, Data blocks can’t be replicated or deleted. Collection of data and statistics from all the DataNodes happen during this time.
The available components of Hive Data Model are as below:
Hive supports below-given complex data types:
In Hive, SerDe stands for Serialization and Deserialization. SerDe is a built-in Library present in Hadoop API. SerDe instructs Hive on how processing of a record(row) can be done.
Deserializer will take binary representation of a record and translate it into the java object that hive can be able to understand. Now, Serializer will take that java object on which Hive is already working and convert that into a format that can be processed by HDFS and can be stored.
The Table creation functions present in Hive are as follows:
The objects created by create statement in MySQL are listed below:
A staple in data engineer technical interview questions and answers, be prepared to answer this one using your hands-on experience.
In Hive, .hiverc acts as the initialization file. Whenever you open the CLI (Command Line Interface) in order to write the code for Hive, .hiverc is the first one file that gets loaded. All the parameters that have been initially set by you are contained in this file.
For example, you can set column headers that you want to be visible in the query results, the addition of any jar files, etc. This file is loaded from the hive conf directory.
Metastore acts as the central repository for Hive metadata. It is used for storing the metadata of Hive tables i.e., schemas and locations.
Metadata is first stored in metastore which is later stored in a relational database (RDBMS) whenever required.
Metastore consists of 3 types of modes for deployment. These are given below.
In Hive, multiple tables can be created for a single data file using the same HDFS directory. As we know already that metastore acts as the central repository for Hive metadata and it stores metadata like schemas and locations.
Data already remain in the same file. So, it becomes a very easy task to retrieve the different results for the corresponding same data based upon the schema.
In Hive, there are some special types of tables in which the values of columns appear in a repeating manner (Skew), these tables are called as skewed tables. In Hive, while creation of a particular table we can specify that table as SKEWED. All the skewed values in the table are written into separate files and the rest of the remaining values are stored in another file.
While writing queries, skewed tables help to provide better performance. Syntax to define a particular table as ‘skewed’ during its creation is as written below using an example.
CREATE TABLE TableName (column1 STRING, column2 STRING) SKEWED BY (column1) on (‘value’)
In MYSQL, we can see the data structure with the help of DESCRIBE command.
The syntax to use this command is as follows.
DESCRIBE Table name;
We can see the list of all tables in MYSQL using SHOW command.
The syntax to use the thing command is as follows.
SHOW TABLES;
We can perform various operations on strings as well as the substrings present in a table. In order to search for a specific string in a table column, we can use REGEX operator for the same.
Following are some of the ways how big data and data analytics can positively impact company’s business.
Below are the steps that need to be followed in order to deploy a big data solution.
FSCK stands for File System Consistency Check. Briefly, we can define FSCK as a command that is used in order to check any inconsistencies or any problems in HDFS file system or at the HDFS level.
Syntax of using FSCK command is as below.
hadoop fsck [ GENERIC OPTIONS] < path > [-delete | -move | -openforwrite ] [-files [ -blocks [ -locations | -racks] ] ]
Yarn is abbreviated ad Yet Another Resource Negotiator. In Hadoop, it is considered as one of the main components. While opening Hadoop, Yarn helps in processing and running data for stream processing, graph processing, batch processing, and interactive processing which are stored in HDFS. So briefly, we can say that YARN helps to run various types of distributed applications.
Using YARN, the efficiency of the system can be increased as data that is stored in HDFS is processed and run by various types of processing engines as depicted above.
It is also known for optimum utilization of all available resources that results in easy processing of a high volume of data.
A staple in senior data engineer interview questions and answers, be prepared to answer this one using your hands-on experience.
In Hadoop, HDFS abbreviated as Hadoop distributed file system is considered as the standard storage mechanism. It is built with the help of commodity hardware. As we all know that till now, Hadoop does not require a costly server with high processing power and bigger storage, we can use inexpensive systems with average processor and RAM. These systems are called as commodity hardware.
These are affordable, easy to obtain and compatibles with various operating systems like Linux, Windows and MS-DOS without any requirement of any special type of devices or equipment. Another benefit of using commodity hardware is its scalability.
The various functions of Secondary NameNode are as follows.
Combiner also known as Mini-Reducer acts as an optional step between Map and Reduce. Briefly we can that, it helps to take the output from Map function. It then summarizes that output using the same key and then it passes the final summarized records as input to the Reducer.
When we make use of MapReduce job on a large dataset. Then large chunk of data is generated by the Mapper which when passed to the reducer for further processing can cause congestion in the network. In order to deal with kind of congestion, Combiner is used by Hadoop Framework as an intermediate between Mapper and Reducer to reduce network congestion.
In Hadoop, when we are dealing with Big Data Systems, then the size of data is huge. Therefore, it is not a good practice to move this large amount of data across the network otherwise it may impact the system output and also causes network congestion.
In order to get rid of these above problems, Hadoop uses the concept of Data Locality. Briefly we can say that, it is the process of moving the computation towards the data rather than doing the opposite process of moving huge amount of data. In this way, data always remain local to storage locations. So, when a user runs a MapReduce job, then the code present in MapReduce is sent by NameNodes to DataNodes that contains the data related to MapReduce job.
Balancer is a utility provided by HDFS. As we know that, DataNodes stores the actual data related to any job or process. Datasets are divided into blocks and these blocks are stored across the DataNodes in Hadoop cluster. Some of these nodes are underutilized and some are overutilized by the storage of blocks, so a balance needs to be maintained.
Here comes the use of balancer which analyses the block placement across various nodes and moves blocks from overutilized to underutilized nodes in order to maintain balance of data across the DataNodes until the cluster is deemed to be balanced.
In Hadoop, Distributed cache is a utility provided by the MapReduce framework. Briefly we can that, it can cache files like jar files, archives and text files when needed for any application.
When MapReduce job is running, this utility caches the read only files and make them available to all the DataNodes. Each DataNodes gets the local copy of the file. Thus, we will be able to access all files present in DataNodes. These files remain in the DataNodes while job is running and these are deleted once the job is completed.
The default size of Distributed cache is 10 GB which can be adjusted according to the requirement using local.cache.size.
In Hive, there are various types of SerDe implementations available. There is also a provision to create your own custom SerDe implementations. Few of the popular implementations are listed below.
In Python, we can pass a variable number of arguments to a function when we are unsure about how may arguments need to be passed to a function. These arguments can be passed using a special type of symbols as depicted below.
Function flexibility can be achieved by passing these two types of special symbols.
The differences between Data warehouse and Database are given below.
Parameter | Data warehouse | Database |
Definition | It is a system that collects and stores information from multiple data sources within an organization | It is organised collection of logical data that is easy to search, manipulate and analyse. |
Purpose and Usage | It is used for the purpose of analysis of your business | It is used for recording the data and performing various fundamental operations for your business. |
Data Availability | When required, data is refreshed and captured from various data sources | Real time data is always available |
Type of data stored | It contains only summarized data | It contains detailed data |
Usage of Queries | Complex queries are used | Simple queries are used |
Tables and Joins | In Data warehouse, Tables and Joins are simple | In Database, Tables and Joins are complex |
The differences between OLAP and OTLP are given below.
OLAP | OLTP |
It is used for managing informational data | It is used for managing operational data |
The size of database is 100 GB-TB | The size of database is 100 MB-GB |
It contains large volume of data | The volume of stored data is not that much large |
It has mainly one access mode that is write mode | It has both read and write access modes |
It is partially normalized | It is completely normalized |
Its processing speed is dependent on lot of factors like the complex queries, number of files it contains etc | It has very high processing speed |
It is market oriented and is mainly used by analysts, managers and executives | It is customer oriented and is mainly used by clerks, clients and IT professionals |
The differences between the NoSQL and SQL database are as below.
Parameter | NoSQL database | SQL database |
History | It is developed in late 2000s with a focus to allow rapid implementation of changes in applications and scalability | It is developed in 1970s with a focus to reduce the problem of data duplication |
Data Storage Model | Tables with rows and dynamic columns are used | Tables with fixed rows and columns are used |
Schemas | Schemas are flexible | Schemas are rigid |
Scaling | Horizontal scaling is possible | Vertical scaling is possible |
Joins usage | Joins are not required in NoSQL | Joins are typically required in SQL |
Examples | MongoDB and CouchDB | MySQL, Oracle, Microsoft SQL Server, and PostgreSQL |
In Modern applications that has complex and constantly changing data sets, NoSQL seems to be better option to use as compared to traditional database as in such applications we need a flexible data model that doesn’t need to be defined immediately.
NoSQL provided various agile features which helps companies to go to the market faster and accordingly make updates faster. It also helps to store real time data.
While using big servers, it is always better approach to scale out rather than scale in when we are dealing with increased data processing load. Using NoSQL is a better option here as it is cost effective and can deal with huge volume of data. Although, relational database provides better connectivity with the analytical tools but still NoSQL is better to use as it offers a lot of features compared to traditional database.
In Python, both list and tuple are classes of data structures. Differences between list and tuple are as follows.
List | Tuple |
Lists are mutable i.e.; they can be modified | Tuples are immutable i.e.; they can’t be modified |
Memory consumption of List is more | Memory consumption of Tuple is less as compared to List |
List is more prone to errors and unexpected changes | Tuple is not prone to such errors and unexpected changes |
It contains a lot of built-in methods | It doesn’t contain much built-in methods |
Operations like insertion and deletion are performed better using list | Tuple is mainly used for accessing the elements |
List has dynamic characteristics so it is slower compared to tuple | Tuple has static characteristics so it is faster |
Syntax: list_data1 = ['list', 'can', 'be', 'modified', 'easily'] | Syntax: tuple_data1 = ('tuple', 'can’t', 'be', 'modified', 'ever') |
We can come across a situation in which we can find multiple duplicate data entries in a table which makes no sense to fetch all those entries to avoid redundancy while we are fetching records from that table. We need only those unique data entries that will make sense to fetch.
For achieving this, DISTINCT keyword is provided by the SQL which we can use with the SELECT statement so that we can eliminate the duplicate data entries and can only fetch unique data entries.
The syntax to use this keyword to eliminate duplicate data is as below:
SELECT DISTINCT column1, column2, column3...columnM FROM table_name1 WHERE [conditions]
We can also this use UNIQUE keyword to handle duplicate data. The UNIQUE constraint is used for ensuring that all the values present in a specific column are different in SQL.
COSHH also known as Classification and Optimization-based Scheduling for Heterogeneous Hadoop systems.
Multiplexing and execution of a lot of tasks happens in a common datacentre in Hadoop system. It will lead to the sharing of the Hadoop cluster among a lot of users which will lead to increase in system heterogeneity. Though in Hadoop schedulers, this issue is not given that much importance. In order to rectify that, here comes the use of COSHH, which is specially designed and implemented in order to provide scheduling at both cluster and application levels. This implementation leads to improve the completion time of job.
This question mainly focuses on knowing how you can actually deal with unexpected problems in high pressure situations.
Unexpected problems are inevitable and a lot of situations arises when you encounter these unexpected problems while doing your daily routine jobs or tasks. Same is the case with data maintenance.
Data maintenance can be considered as one of the daily basis tasks which need to be monitored properly to make sure all the inbuild tasks and corresponding scripts are getting executed as per expectation. As an example, in order to prevent addition of corrupt indexes into the database, we can create various maintenance tasks which can prevent addition of these corrupt indexes to the database to avoid any serious damage.
Advantages and disadvantages of cloud computing are as follows.
Advantages:
Disadvantages:
This question mainly focuses on knowing what problems you have faced while working as a data engineer in you prior experience. We can depict some of the most common problems here as an answer.
In modern world, data has become the new currency. Both the roles of data engineer and data scientists revolves around data only but there are some differences in their duties which are as mentioned below.
Data Engineer | Data Scientist |
This role mainly focuses on collection and analysis of data. It focuses on designing and implementing of various pipelines which can be used in manipulating and transformation of unstructured data into required format that can be used by data scientists for analysis purposes. | This role mainly focuses on using that required format data and extract various data patterns from it by using various analytical tools, mathematics and statistical knowledge and provide these deep understandings that may positively impact the business. |
Considering the importance of data, it is the duty of data engineer to keep the data safe and secure and also should take data backups to avoid any loss. | After performing data analysis on the data, it is the job of data engineer to convey their analysis results to the stakeholders. Good communication skills are must for them. |
Big Data and Database management skills are must for a data engineer | Machine learning is a must skill required for a data scientist. |
NFS is Network File System and HDFS is Hadoop Distributed File System. The various differences between the two are as follows.
NFS | HDFS |
Only small amount of data can be stored and processed with NFS | Large amount of data or big data can be stored and processed with HDFS. |
NFS stores data on a single dedicated machine or a disk of a dedicated network machine. These data files can be accessed by the clients over the network | HDFS stores data in a distributes manner. In other words, data is stores on many dedicated machines or network computers |
NFS is not fault tolerant and data can be lost if there is some failure is caused. This data can’t be recovered in future. | HDFS is fault tolerant and data can be easily recovered in case of any failure caused in nodes. |
There is no data redundancy that can occur in NFS as all the data get stored on a single dedicated machine. | Data redundancy can be point of concern here because of the replication of same data files across the multiple dedicated machines. |
Feature selection is the process of identifying and selecting the most relevant features that can be input to the machine learning algorithms for the purpose of model creation.
Feature selection techniques are used for the purpose of neglecting all the redundant or unrelated features as an input to the machine learning models by decreasing the number of input variables and narrowing down the features to only the desired relevant features. There are few advantages of using these feature selection techniques which are mentioned below.
There are few ways with the help of which we can handle missing value in Big Data. These are as follows.
Other than the above-mentioned techniques we can also use K-NN algorithm, The RandomForest algorithm, Naive Bayes algorithm, and Last Observation Carried Forward (LCOF) methods in order to handle missing values in Big Data.
Outliers are the data records which are different from the normal records in some of their characteristics. It is very important to first decide the characteristics of the normal records in order to detect the outliers. These records when used in algorithms or analytical systems can provide abnormal results which may impact the analysis process. So, it is very important to detect the outliers to avoid such abnormalities.
We can detect outliers using tables or graphs by directly looking at them. As an example, let suppose there is table containing the Name and Age of few people and one of the rows representing a person contains Age as 500. So, we can easily analyse that value to be an invalid value as age can be 40,50 or 55 but it can’t be 500. So, we can predict the age but can’t sure about the exact value. This kind of detection is easy when we are dealing with a table with limited records but if the tables contain thousands of records, then it becomes impossible to detect the outliers.
The difference between the K-Nearest Neighbour and K-Means methods are as below.
KNN | K-means |
KNN is supervised learning algorithm which can be used for classification or regression purposes. The Classification of nearest K points is done by KNN so that category of all points can be easily determined | k-means clustering or method is an unsupervised learning algorithm which can be used for clustering purpose. Here you select K number of clusters and then place each of the data points into those K clusters |
The performance of KNN is better if all the data is having same scale | This doesn’t stand true for K-means |
Logistic regression acts as a predictive model which is used to analyse large datasets to determine a binary output considering an input variable is provided. The binary output can take only limited number of values which can be 0/1, true/false or yes/no.
Logistic regression make use a sigmoid function in order to determine various possible outcomes and their corresponding probabilities of occurrence and then map them both on a graph. There is always an acceptance threshold which is set to determine whether a particular instance belongs to that class or not. If the probability of an outcomes is more than that threshold then that instance belongs to that class otherwise it doesn’t belong to that class if the probability is less than the acceptance threshold. There are three types of logistic regression. These are as listed below.
A/B testing also known as split testing is a random statistical experiment performed on two different variants (A and B) of a webpage or any application by showing these variants to set of end users and analysing which of the two variants is creating a larger impact on the users or which variant proved to be more effective and beneficial to the end users. A/B testing is having a lot benefits which as follows.
Collaborative filtering is a process or technique which make use of various algorithms in order to provide personalized recommendations to the users. It is also known with name of social filtering. Some of the popular websites which make use of this kind of filtering are iTunes, Amazon, Flipkart, Netflix etc.
In Collaborative filtering, a user is provided with personal recommendations based upon the compilation of common interest or preferences from other users with the help of prediction algorithms. We can take an example of two users A and B. Let’s suppose user A visits Amazon and bought item 1 and 2 and when user B will try to buy that same item 1 then item 2 will be recommended to the user B based upon predictive analysis.
“is” operator is used for the purpose of reference equality to check whether the two references or variables are pointing to the same object or not. Accordingly, it returns value as true or false.
“==” operator is used for the purpose of value equality to check whether the two variables are having same value or not. Accordingly, it returns value as true or false.
We can take any example with the help of two lists X and Y.
X = [1,2,3,4,5]
Y = [1,2,3,4,5]
Z = Y
Python memory manager does the task of managing memory in Python. All the data structures and objects in Python are stored in private heap. It is the duty of Python memory manager only to manage this private heap. Developers can’t access this private heap space. This private heap space can be allocated to objects by memory manager.
Python memory manager contains object specific allocators to allocate the space to specific objects. Along with that it also has raw memory allocators to make sure that space is allocated to the private heap.
In Python, developers create a Garbage collector so that they don’t need to manually do garbage collection. The main job of this collector is to clear out unused space and make it available for other new objects or private heap space.
Decorators can be considered as one of the most important and powerful tools that are present in python. We can temporarily modify the behaviour of a function or a class with the help of this tool.
Decorator helps to wrap the function or a class with another function to modify the behaviour of wrapped function or a class without making any permanent changes to that specific function source code.
In Python we can easily use or pass functions as arguments as they are considered as first-class objects. In Decorator, a function acting as a first-class object can be passed an argument to another function and then it will be called inside the wrapper function.
append(): In Python, when we will pass an argument to append() then it will be added as a single entity in the list. In other words, we can say that when we try to append a list into another list then that whole list is added as a single object to the other list’s end and hence the length of the list will be incremented by 1 only. Append() has fixed time complexity of O(1)
Example: Let’s take an example of two lists as shown below.
list1 = [“Alpha”, “Beta”, “Gamma”] list2 = [“Delta”, “Eta”, “Theta”] list1.append(list2) list1 will now become: [“Alpha”, “Beta”, “Gamma”, [“Delta”, “Eta”, “Theta”]]
The length of list1 will now become 4 after addition of second list as a single entity.
extend(): In Python, when we will pass an argument to extend() then all the elements which are contained in that argument get added to the list or in other words the argument will be iterated over. So, the length of the list will be incremented by the number of elements which have been added from another list. extend() has time complexity of O(n) where n is the number of the elements in an argument which has been passed to the extend().
Example: Let’s take an example of two lists as shown below.
list1 = [“Alpha”, “Beta”, “Gamma”] list2 = [“Delta”, “Eta”, “Theta”] list1.extend(list2) list1 will now become: [“Alpha”, “Beta”, “Gamma”, “Delta”, “Eta”, “Theta”].
The length of list1 will now become 6 in this scenario.
In Python, loops statements are used in order to do repetitive tasks with good efficiency. But in some scenarios, we need to come out of those loop statements or ignore some of the conditions. For these scenarios Python provides various loop control statements in order to take control over loops. These statements are as below.
In Python, SciPy is an open-source library which is used for the purpose of solving various engineering, mathematical, technical and scientific problems. We can easily manipulate data with the help of SciPy and perform the data visualisation with a wide number of high-level commands present in Python. SciPy is called as “Sign Pi”.
NumPy acts as the foundation of SciPy as it is built on it. SciPy libraries are built only to make sure that it can be able to work with the arrays of NumPy. Optimization and numeric integration are also possible by using the various numerical practices like routines which are provided by SciPy. In order to setup SciPy in your systems, below are syntax for the same for different operating systems.
Windows:
Syntax: Python3 -m pip install --user numpy scipy
Linux:
Syntax: sudo apt-get install python-scipy python-numpy
Mac:
Syntax: sudo port install py35-scipy py35-numpy
BETWEEN operator: In Python, BETWEEN operator is used in order to test whether the provided expression lies between a defined range of values or not. While testing, the range is inclusive. These values can be of any type like date, number or text. We can use this BETWEEN operator with SELECT, INSERT, DELETE, and UPDATE statements. The syntax to apply this operator is as below.
SELECT column_name(s) FROM table_name WHERE column_name BETWEEN value1 AND value2;
Output: It will return all the values from above column_name which lies between value1 and value2 including these 2 values also.
IN operator: In Python, IN operator is used to check whether an expression matches with some of values that has been specified in a list containing values. It can be used in order to eliminate the use of multiple OR conditions. We can also use NOT IN operator which functions exactly opposite to the IN operator to exclude certain rows from the output list. We can use this IN operator or NOT IN operator with SELECT, INSERT, DELETE, and UPDATE statements. The syntax to apply this operator is as below.
IN: SELECT column_name(s) FROM table_name WHERE column_name IN (list_of_values);
Output: It will return all the values from above column_name which matches with the specified “list_of_values”
NOT IN: SELECT column_name(s) FROM table_name WHERE column_name NOT IN (list_of_values);
Output: It will return all the values from above column_name excluding the specified “list_of_values”
In SQL, we can provide temporary names to either columns or table, these are called as aliases for a specific query. When we don’t want to use the original name of the table or column, then we use alias to provide temporary names to them. The scope of the alias is temporary and up to that specified query only.
We use alias to increase the readability of a column or a table name. This change is temporary and the original names that are stored in the database never get changed. Sometimes the names of table or column are complex so it is always preferred to use alias to give them an easy name temporarily. Below is the syntax to use alias for both table and column names.
Column Alias:
Syntax: SELECT column as alias_name FROM table_name;
Explanation: Here alias_name is the temporary name that is given to column name in the given table table_name.
Table Alias:
Syntax: SELECT column FROM table_name as alias_name;
Explanation: Here alias_name is the temporary name that is given to table table_name.
SQL injection is the process of inserting malicious SQL commands to the database that can exploit the user data stored in it. By inserting these statements, hackers actually take control of the database and can destroy and manipulate sensitive information stored in database. These SQL command insertions or SQL injection mainly happens using inputs through web pages which is one of the most common web hacking techniques.
In Web applications, usually web servers do communication with the database servers in order to retrieve or store data related to user in the database. Hackers input these malicious SQL codes which are executed once the web server tries to make connection with the database server resulting in compromising the security of the web application.
We can make use of Restricted access privileges and user authentication to avoid any security breach which may impact the critical data present in database. Another way is to avoid using system administrator accounts.
This is a common yet one of the most important data engineer interview questions and answers for experienced professionals, don't miss this one.
In SQL, Trigger acts a stored procedure which gets invoked when a triggering event occurs in a database. These triggering events can be caused due to insertion, deletion or updating of any row or column in a particular table. For example, trigger can be invoked when a new row is added or deleted from a table or any row is updated. The syntax to create a tigger in SQL is as below.
Syntax:
create trigger [trigger_name] [before | after] {insert | update | delete} on [table_name] [for each row] [trigger_body]
Explanation:
1. Trigger will be created with a name as [trigger_name] whose execution is determined by [before | after].
2. {insert | update | delete} are examples of DML operations.
3. [table_name] is the table which is associated with trigger.
4. [for each row] determines the rows for which trigger will be executed.
5. [trigger_body] determines the operations that needs to be performed after trigger is invoked.
Description: Data Engineering is very important term used in big data. It is the process of transforming the raw entity of data (data generated from various sources) into helpful information that can be used for various purposes. Data Engineering has become one of the most popular career choices today.
According to a study, it has been expected that the data engineering services and global big data will grow from USD 29.50 billion that was in 2017 to USD 77.37 billion by 2023, at a Compound Annual Growth Rate (CAGR) of 17.6% during the forecast period. 2017 is taken as the base year for this study, and the forecast period taken here is 2018–2023. Data engineer has to take up a lot of responsibilities daily, from collecting to analyzing data with the help of many tools.
If you are interested in data engineering and looking for top interview questions and answers in the field of data engineering, then these above beginner and advance level questions are best for you which keep into consideration various skills of data engineering like Python, Big data, Hadoop, SQL, Database, etc. Data analyst and data engineer jobs are increasing at a faster rate in the market and market has a lot opportunities for both freshers and experienced engineers across the world. Good conceptual knowledge and hold on logics will help you crack interviews in many reputed companies. The above questions are designed to help understand the concepts of data engineering deeply. We have tried to cover almost every topic of data engineering.
If you go through the above-mentioned, you will easily find questions from beginner to an advanced level according to your level of expertise. These questions will help you give an extra edge over the other applicants who will apply for data engineering jobs. If you want to study data engineering topics deeply, you can enroll in big data courses on KnowledgeHut that can help you to boost your basic and advanced skills.
Best of Luck.
This may seem like a pretty basic question, but regardless of your skill level, this is one of the most common questions that can come up during your interview. So, what is it? Briefly, Data Engineering is a term used in big data. It is the process of transforming the raw entity of data (data generated from various sources) into useful information that can be used for various purposes.
Data modelling is the scientific process of converting and transforming complex software data systems by breaking them up into simple diagrams that are easy to understand, thus making the system independent of any pre-requisites. You can explain any prior experience with Data Modelling, if any, in the form of some scenarios.
Companies can ask you questions about design schemas in order to test your knowledge regarding the fundamentals of data engineering. Data Modelling consists of mainly two types of schemas:
The difference between structured and unstructured data is as follows-
Parameter | Structured Data | Unstructured Data |
Storage | DBMS | File structures are unmanaged |
Standard | ODBC, ADO.net, and SQL | XML, STMP, CSV, and SMS |
Integration Tool | ELT (Extract, Transform, Load) | Batch processing or Manual data entry |
Scaling | Schema scaling is difficult | Schema Scaling is very easy. |
Version management | Versioning over tuples, row and tables | Versioned as a whole is possible |
Example | An ordered text dataset file | Images, video files, audio files, etc. |
In today’s world, the majority of big applications are generating big data that requires vast space and a large amount of processing power, Hadoop plays a significant role in providing such provision to the database world.
This is one of the most frequently asked data engineer interview questions for freshers in recent times. Here are the components of a Hadoop application.
A Hadoop Application is consist of -
NameNode is the master node in the Hadoop HDFS Architecture. It is used to store all the data of HDFS and also keep track of various files in all clusters. The NameNodes don’t store the actual data but only the metadata of HDFS. The actual data gets stored in the DataNodes.
Don't be surprised if this question pops up as one of the top Python interview questions for big data engineer in your next interview.
Hadoop streaming is one of the widely used utilities that comes with the Hadoop distribution. This utility is provided for allowing the user to create and run Map/Reduce jobs with the help of various programming languages like Ruby, Perl, Python, C++, etc. which can then be submitted to a specific cluster for usage.
Some of the important features of Hadoop are as below:
Expect to come across this, one of the most important data engineer interview questions for experienced professionals in data engineering, in your next interviews.
Blocks are considered as the smallest unit of data that is allocated to a file that is created automatically by the Hadoop System for storage of data in a different set of nodes in a distributed system. Large files are automatically sliced into small chunks called as blocks by Hadoop.
Block scanner as its name suggests, is used to verify whether the small chunks of files known as blocks that are created by Hadoop are successfully stored in DataNode or not. It helps to detect the corrupt blocks present in DataNode.
Following are the steps followed by the block scanner when it detects a corrupted DataNode block-
This whole process helps HDFS in maintaining the integrity of the data during read operation performed by a client.
A must-know for anyone looking for top data engineer interview questions, this is one of the frequently asked big data engineer interview questions.
Below are the steps to achieve security in Hadoop:
NameNode communicates and gets information from DataNode via messages or signals.
There are two types of messages/signals that are used for this communication across the channel:
The default ports for Task Tracker, Job Tracker, and NameNode in Hadoop are as below:
This question is asked by interviewers to check your understanding of the role of a data engineer.
The difference between NAS and DAS is as follows:
NAS | DAS |
NAS stand for Network Attached Storage | DAS stands for Direct Attached Storage |
Storage capacity of NAS is between 109 to 1012 in byte. | Storage capacity of DAS is 109 in byte. |
In NAS, Storage is distributed over distinct servers on a network | In DAS, storage is attached to the node where computation process is taking place. |
It has moderate storage management cost | It has high storage management cost |
Data transmission takes place using Ethernet or TCP/IP. | Data transmission takes place using IDE/ SCSI |
Below are various fields or languages used by data engineer:
In Hadoop, Rack awareness is the concept of choosing the DataNodes which are closer according to the rack information. By default, Hadoop assumes that all the nodes belong to the same rack.
In order to improve the network traffic while reading/writing HDFS files that are on the same or a nearby rack, NameNode uses the DataNode to read/ write requests. To achieve rack information, the rack ids of each DataNode are maintained by HDFS NameNode. This concept in HDFS is known as Rack Awareness.
When NameNode is down, it means that the entire cluster is down. So, the cluster won’t be accessible as it is down. All the services which are running on that cluster will also be down. So, in this scenario, if any user tries to submit a new job will get an error and job will get failed. All the existing jobs which are running will also get failed.
So briefly, we can say that when NameNode will get down, all the new, as well as existing jobs, will get failed as all services will be down. The user has to wait for the NameNode to restart and can run a job once the NameNode will get up.
Four Vs of big data describes four dimensions of big data. These are listed below:
The various XML configuration files present in Hadoop are as follows:
The main methods of reducer are given below:
One of the most frequently posed data engineer scenario based questions, be ready for this conceptual question.
FIFO also known as First In First Out is the simple job scheduling algorithm in Hadoop which implies that the tasks or processes that come first will be served first. In Hadoop, FIFO is the default scheduler. All the tasks or processes are placed in a queue and they get their turn to get executed according to their order of submission. There is one major disadvantage of this type of scheduling which is that the higher priority tasks have to wait for their turn which can impact the process.
Hadoop operations can be used in three different modes. These are listed below:
This, along with other data engineer questions for freshers, is a regular feature in data engineer interviews, be ready to tackle it with the approach mentioned below.
In Hadoop, replication factor depicts the number of times the framework replicates or duplicates the Data blocks in a system. The default replication factor in Hadoop is 3 which can be manipulated as per the system requirements. The main advantage of the replication process is to ensure data availability.
We can configure the replication factor in hdfs-site.xml file which can be less than or more than 3 according to the requirements.
In Hadoop, the primary phases of reducer are as follows:
The distance between two nodes is equal to the simple sum of the distance to the closest nodes. In order to calculate the distance between two nodes, we can use getDistance() method for the same.
In Hadoop, Context object is used along with the Mapper class so that it can interact with the other remaining parts of the system. Using the Context object, all the jobs and the system configuration details can be easily obtained in its constructor.
Information can be easily passed or sent to the methods like cleanup(), setup() and map() using the Context object. During map operations, vital information can be made available using the Context object.
In Apache Hadoop, Safe mode is a mode that is used for the purpose of maintenance. It acts as read-only mode for NameNode in order to avoid any modifications to the file systems. During Safe mode in HDFS, Data blocks can’t be replicated or deleted. Collection of data and statistics from all the DataNodes happen during this time.
The available components of Hive Data Model are as below:
Hive supports below-given complex data types:
In Hive, SerDe stands for Serialization and Deserialization. SerDe is a built-in Library present in Hadoop API. SerDe instructs Hive on how processing of a record(row) can be done.
Deserializer will take binary representation of a record and translate it into the java object that hive can be able to understand. Now, Serializer will take that java object on which Hive is already working and convert that into a format that can be processed by HDFS and can be stored.
The Table creation functions present in Hive are as follows:
The objects created by create statement in MySQL are listed below:
A staple in data engineer technical interview questions and answers, be prepared to answer this one using your hands-on experience.
In Hive, .hiverc acts as the initialization file. Whenever you open the CLI (Command Line Interface) in order to write the code for Hive, .hiverc is the first one file that gets loaded. All the parameters that have been initially set by you are contained in this file.
For example, you can set column headers that you want to be visible in the query results, the addition of any jar files, etc. This file is loaded from the hive conf directory.
Metastore acts as the central repository for Hive metadata. It is used for storing the metadata of Hive tables i.e., schemas and locations.
Metadata is first stored in metastore which is later stored in a relational database (RDBMS) whenever required.
Metastore consists of 3 types of modes for deployment. These are given below.
In Hive, multiple tables can be created for a single data file using the same HDFS directory. As we know already that metastore acts as the central repository for Hive metadata and it stores metadata like schemas and locations.
Data already remain in the same file. So, it becomes a very easy task to retrieve the different results for the corresponding same data based upon the schema.
In Hive, there are some special types of tables in which the values of columns appear in a repeating manner (Skew), these tables are called as skewed tables. In Hive, while creation of a particular table we can specify that table as SKEWED. All the skewed values in the table are written into separate files and the rest of the remaining values are stored in another file.
While writing queries, skewed tables help to provide better performance. Syntax to define a particular table as ‘skewed’ during its creation is as written below using an example.
CREATE TABLE TableName (column1 STRING, column2 STRING) SKEWED BY (column1) on (‘value’)
In MYSQL, we can see the data structure with the help of DESCRIBE command.
The syntax to use this command is as follows.
DESCRIBE Table name;
We can see the list of all tables in MYSQL using SHOW command.
The syntax to use the thing command is as follows.
SHOW TABLES;
We can perform various operations on strings as well as the substrings present in a table. In order to search for a specific string in a table column, we can use REGEX operator for the same.
Following are some of the ways how big data and data analytics can positively impact company’s business.
Below are the steps that need to be followed in order to deploy a big data solution.
FSCK stands for File System Consistency Check. Briefly, we can define FSCK as a command that is used in order to check any inconsistencies or any problems in HDFS file system or at the HDFS level.
Syntax of using FSCK command is as below.
hadoop fsck [ GENERIC OPTIONS] < path > [-delete | -move | -openforwrite ] [-files [ -blocks [ -locations | -racks] ] ]
Yarn is abbreviated ad Yet Another Resource Negotiator. In Hadoop, it is considered as one of the main components. While opening Hadoop, Yarn helps in processing and running data for stream processing, graph processing, batch processing, and interactive processing which are stored in HDFS. So briefly, we can say that YARN helps to run various types of distributed applications.
Using YARN, the efficiency of the system can be increased as data that is stored in HDFS is processed and run by various types of processing engines as depicted above.
It is also known for optimum utilization of all available resources that results in easy processing of a high volume of data.
A staple in senior data engineer interview questions and answers, be prepared to answer this one using your hands-on experience.
In Hadoop, HDFS abbreviated as Hadoop distributed file system is considered as the standard storage mechanism. It is built with the help of commodity hardware. As we all know that till now, Hadoop does not require a costly server with high processing power and bigger storage, we can use inexpensive systems with average processor and RAM. These systems are called as commodity hardware.
These are affordable, easy to obtain and compatibles with various operating systems like Linux, Windows and MS-DOS without any requirement of any special type of devices or equipment. Another benefit of using commodity hardware is its scalability.
The various functions of Secondary NameNode are as follows.
Combiner also known as Mini-Reducer acts as an optional step between Map and Reduce. Briefly we can that, it helps to take the output from Map function. It then summarizes that output using the same key and then it passes the final summarized records as input to the Reducer.
When we make use of MapReduce job on a large dataset. Then large chunk of data is generated by the Mapper which when passed to the reducer for further processing can cause congestion in the network. In order to deal with kind of congestion, Combiner is used by Hadoop Framework as an intermediate between Mapper and Reducer to reduce network congestion.
In Hadoop, when we are dealing with Big Data Systems, then the size of data is huge. Therefore, it is not a good practice to move this large amount of data across the network otherwise it may impact the system output and also causes network congestion.
In order to get rid of these above problems, Hadoop uses the concept of Data Locality. Briefly we can say that, it is the process of moving the computation towards the data rather than doing the opposite process of moving huge amount of data. In this way, data always remain local to storage locations. So, when a user runs a MapReduce job, then the code present in MapReduce is sent by NameNodes to DataNodes that contains the data related to MapReduce job.
Balancer is a utility provided by HDFS. As we know that, DataNodes stores the actual data related to any job or process. Datasets are divided into blocks and these blocks are stored across the DataNodes in Hadoop cluster. Some of these nodes are underutilized and some are overutilized by the storage of blocks, so a balance needs to be maintained.
Here comes the use of balancer which analyses the block placement across various nodes and moves blocks from overutilized to underutilized nodes in order to maintain balance of data across the DataNodes until the cluster is deemed to be balanced.
In Hadoop, Distributed cache is a utility provided by the MapReduce framework. Briefly we can that, it can cache files like jar files, archives and text files when needed for any application.
When MapReduce job is running, this utility caches the read only files and make them available to all the DataNodes. Each DataNodes gets the local copy of the file. Thus, we will be able to access all files present in DataNodes. These files remain in the DataNodes while job is running and these are deleted once the job is completed.
The default size of Distributed cache is 10 GB which can be adjusted according to the requirement using local.cache.size.
In Hive, there are various types of SerDe implementations available. There is also a provision to create your own custom SerDe implementations. Few of the popular implementations are listed below.
In Python, we can pass a variable number of arguments to a function when we are unsure about how may arguments need to be passed to a function. These arguments can be passed using a special type of symbols as depicted below.
Function flexibility can be achieved by passing these two types of special symbols.
The differences between Data warehouse and Database are given below.
Parameter | Data warehouse | Database |
Definition | It is a system that collects and stores information from multiple data sources within an organization | It is organised collection of logical data that is easy to search, manipulate and analyse. |
Purpose and Usage | It is used for the purpose of analysis of your business | It is used for recording the data and performing various fundamental operations for your business. |
Data Availability | When required, data is refreshed and captured from various data sources | Real time data is always available |
Type of data stored | It contains only summarized data | It contains detailed data |
Usage of Queries | Complex queries are used | Simple queries are used |
Tables and Joins | In Data warehouse, Tables and Joins are simple | In Database, Tables and Joins are complex |
The differences between OLAP and OTLP are given below.
OLAP | OLTP |
It is used for managing informational data | It is used for managing operational data |
The size of database is 100 GB-TB | The size of database is 100 MB-GB |
It contains large volume of data | The volume of stored data is not that much large |
It has mainly one access mode that is write mode | It has both read and write access modes |
It is partially normalized | It is completely normalized |
Its processing speed is dependent on lot of factors like the complex queries, number of files it contains etc | It has very high processing speed |
It is market oriented and is mainly used by analysts, managers and executives | It is customer oriented and is mainly used by clerks, clients and IT professionals |
The differences between the NoSQL and SQL database are as below.
Parameter | NoSQL database | SQL database |
History | It is developed in late 2000s with a focus to allow rapid implementation of changes in applications and scalability | It is developed in 1970s with a focus to reduce the problem of data duplication |
Data Storage Model | Tables with rows and dynamic columns are used | Tables with fixed rows and columns are used |
Schemas | Schemas are flexible | Schemas are rigid |
Scaling | Horizontal scaling is possible | Vertical scaling is possible |
Joins usage | Joins are not required in NoSQL | Joins are typically required in SQL |
Examples | MongoDB and CouchDB | MySQL, Oracle, Microsoft SQL Server, and PostgreSQL |
In Modern applications that has complex and constantly changing data sets, NoSQL seems to be better option to use as compared to traditional database as in such applications we need a flexible data model that doesn’t need to be defined immediately.
NoSQL provided various agile features which helps companies to go to the market faster and accordingly make updates faster. It also helps to store real time data.
While using big servers, it is always better approach to scale out rather than scale in when we are dealing with increased data processing load. Using NoSQL is a better option here as it is cost effective and can deal with huge volume of data. Although, relational database provides better connectivity with the analytical tools but still NoSQL is better to use as it offers a lot of features compared to traditional database.
In Python, both list and tuple are classes of data structures. Differences between list and tuple are as follows.
List | Tuple |
Lists are mutable i.e.; they can be modified | Tuples are immutable i.e.; they can’t be modified |
Memory consumption of List is more | Memory consumption of Tuple is less as compared to List |
List is more prone to errors and unexpected changes | Tuple is not prone to such errors and unexpected changes |
It contains a lot of built-in methods | It doesn’t contain much built-in methods |
Operations like insertion and deletion are performed better using list | Tuple is mainly used for accessing the elements |
List has dynamic characteristics so it is slower compared to tuple | Tuple has static characteristics so it is faster |
Syntax: list_data1 = ['list', 'can', 'be', 'modified', 'easily'] | Syntax: tuple_data1 = ('tuple', 'can’t', 'be', 'modified', 'ever') |
We can come across a situation in which we can find multiple duplicate data entries in a table which makes no sense to fetch all those entries to avoid redundancy while we are fetching records from that table. We need only those unique data entries that will make sense to fetch.
For achieving this, DISTINCT keyword is provided by the SQL which we can use with the SELECT statement so that we can eliminate the duplicate data entries and can only fetch unique data entries.
The syntax to use this keyword to eliminate duplicate data is as below:
SELECT DISTINCT column1, column2, column3...columnM FROM table_name1 WHERE [conditions]
We can also this use UNIQUE keyword to handle duplicate data. The UNIQUE constraint is used for ensuring that all the values present in a specific column are different in SQL.
COSHH also known as Classification and Optimization-based Scheduling for Heterogeneous Hadoop systems.
Multiplexing and execution of a lot of tasks happens in a common datacentre in Hadoop system. It will lead to the sharing of the Hadoop cluster among a lot of users which will lead to increase in system heterogeneity. Though in Hadoop schedulers, this issue is not given that much importance. In order to rectify that, here comes the use of COSHH, which is specially designed and implemented in order to provide scheduling at both cluster and application levels. This implementation leads to improve the completion time of job.
This question mainly focuses on knowing how you can actually deal with unexpected problems in high pressure situations.
Unexpected problems are inevitable and a lot of situations arises when you encounter these unexpected problems while doing your daily routine jobs or tasks. Same is the case with data maintenance.
Data maintenance can be considered as one of the daily basis tasks which need to be monitored properly to make sure all the inbuild tasks and corresponding scripts are getting executed as per expectation. As an example, in order to prevent addition of corrupt indexes into the database, we can create various maintenance tasks which can prevent addition of these corrupt indexes to the database to avoid any serious damage.
Advantages and disadvantages of cloud computing are as follows.
Advantages:
Disadvantages:
This question mainly focuses on knowing what problems you have faced while working as a data engineer in you prior experience. We can depict some of the most common problems here as an answer.
In modern world, data has become the new currency. Both the roles of data engineer and data scientists revolves around data only but there are some differences in their duties which are as mentioned below.
Data Engineer | Data Scientist |
This role mainly focuses on collection and analysis of data. It focuses on designing and implementing of various pipelines which can be used in manipulating and transformation of unstructured data into required format that can be used by data scientists for analysis purposes. | This role mainly focuses on using that required format data and extract various data patterns from it by using various analytical tools, mathematics and statistical knowledge and provide these deep understandings that may positively impact the business. |
Considering the importance of data, it is the duty of data engineer to keep the data safe and secure and also should take data backups to avoid any loss. | After performing data analysis on the data, it is the job of data engineer to convey their analysis results to the stakeholders. Good communication skills are must for them. |
Big Data and Database management skills are must for a data engineer | Machine learning is a must skill required for a data scientist. |
NFS is Network File System and HDFS is Hadoop Distributed File System. The various differences between the two are as follows.
NFS | HDFS |
Only small amount of data can be stored and processed with NFS | Large amount of data or big data can be stored and processed with HDFS. |
NFS stores data on a single dedicated machine or a disk of a dedicated network machine. These data files can be accessed by the clients over the network | HDFS stores data in a distributes manner. In other words, data is stores on many dedicated machines or network computers |
NFS is not fault tolerant and data can be lost if there is some failure is caused. This data can’t be recovered in future. | HDFS is fault tolerant and data can be easily recovered in case of any failure caused in nodes. |
There is no data redundancy that can occur in NFS as all the data get stored on a single dedicated machine. | Data redundancy can be point of concern here because of the replication of same data files across the multiple dedicated machines. |
Feature selection is the process of identifying and selecting the most relevant features that can be input to the machine learning algorithms for the purpose of model creation.
Feature selection techniques are used for the purpose of neglecting all the redundant or unrelated features as an input to the machine learning models by decreasing the number of input variables and narrowing down the features to only the desired relevant features. There are few advantages of using these feature selection techniques which are mentioned below.
There are few ways with the help of which we can handle missing value in Big Data. These are as follows.
Other than the above-mentioned techniques we can also use K-NN algorithm, The RandomForest algorithm, Naive Bayes algorithm, and Last Observation Carried Forward (LCOF) methods in order to handle missing values in Big Data.
Outliers are the data records which are different from the normal records in some of their characteristics. It is very important to first decide the characteristics of the normal records in order to detect the outliers. These records when used in algorithms or analytical systems can provide abnormal results which may impact the analysis process. So, it is very important to detect the outliers to avoid such abnormalities.
We can detect outliers using tables or graphs by directly looking at them. As an example, let suppose there is table containing the Name and Age of few people and one of the rows representing a person contains Age as 500. So, we can easily analyse that value to be an invalid value as age can be 40,50 or 55 but it can’t be 500. So, we can predict the age but can’t sure about the exact value. This kind of detection is easy when we are dealing with a table with limited records but if the tables contain thousands of records, then it becomes impossible to detect the outliers.
The difference between the K-Nearest Neighbour and K-Means methods are as below.
KNN | K-means |
KNN is supervised learning algorithm which can be used for classification or regression purposes. The Classification of nearest K points is done by KNN so that category of all points can be easily determined | k-means clustering or method is an unsupervised learning algorithm which can be used for clustering purpose. Here you select K number of clusters and then place each of the data points into those K clusters |
The performance of KNN is better if all the data is having same scale | This doesn’t stand true for K-means |
Logistic regression acts as a predictive model which is used to analyse large datasets to determine a binary output considering an input variable is provided. The binary output can take only limited number of values which can be 0/1, true/false or yes/no.
Logistic regression make use a sigmoid function in order to determine various possible outcomes and their corresponding probabilities of occurrence and then map them both on a graph. There is always an acceptance threshold which is set to determine whether a particular instance belongs to that class or not. If the probability of an outcomes is more than that threshold then that instance belongs to that class otherwise it doesn’t belong to that class if the probability is less than the acceptance threshold. There are three types of logistic regression. These are as listed below.
A/B testing also known as split testing is a random statistical experiment performed on two different variants (A and B) of a webpage or any application by showing these variants to set of end users and analysing which of the two variants is creating a larger impact on the users or which variant proved to be more effective and beneficial to the end users. A/B testing is having a lot benefits which as follows.
Collaborative filtering is a process or technique which make use of various algorithms in order to provide personalized recommendations to the users. It is also known with name of social filtering. Some of the popular websites which make use of this kind of filtering are iTunes, Amazon, Flipkart, Netflix etc.
In Collaborative filtering, a user is provided with personal recommendations based upon the compilation of common interest or preferences from other users with the help of prediction algorithms. We can take an example of two users A and B. Let’s suppose user A visits Amazon and bought item 1 and 2 and when user B will try to buy that same item 1 then item 2 will be recommended to the user B based upon predictive analysis.
“is” operator is used for the purpose of reference equality to check whether the two references or variables are pointing to the same object or not. Accordingly, it returns value as true or false.
“==” operator is used for the purpose of value equality to check whether the two variables are having same value or not. Accordingly, it returns value as true or false.
We can take any example with the help of two lists X and Y.
X = [1,2,3,4,5]
Y = [1,2,3,4,5]
Z = Y
Python memory manager does the task of managing memory in Python. All the data structures and objects in Python are stored in private heap. It is the duty of Python memory manager only to manage this private heap. Developers can’t access this private heap space. This private heap space can be allocated to objects by memory manager.
Python memory manager contains object specific allocators to allocate the space to specific objects. Along with that it also has raw memory allocators to make sure that space is allocated to the private heap.
In Python, developers create a Garbage collector so that they don’t need to manually do garbage collection. The main job of this collector is to clear out unused space and make it available for other new objects or private heap space.
Decorators can be considered as one of the most important and powerful tools that are present in python. We can temporarily modify the behaviour of a function or a class with the help of this tool.
Decorator helps to wrap the function or a class with another function to modify the behaviour of wrapped function or a class without making any permanent changes to that specific function source code.
In Python we can easily use or pass functions as arguments as they are considered as first-class objects. In Decorator, a function acting as a first-class object can be passed an argument to another function and then it will be called inside the wrapper function.
append(): In Python, when we will pass an argument to append() then it will be added as a single entity in the list. In other words, we can say that when we try to append a list into another list then that whole list is added as a single object to the other list’s end and hence the length of the list will be incremented by 1 only. Append() has fixed time complexity of O(1)
Example: Let’s take an example of two lists as shown below.
list1 = [“Alpha”, “Beta”, “Gamma”] list2 = [“Delta”, “Eta”, “Theta”] list1.append(list2) list1 will now become: [“Alpha”, “Beta”, “Gamma”, [“Delta”, “Eta”, “Theta”]]
The length of list1 will now become 4 after addition of second list as a single entity.
extend(): In Python, when we will pass an argument to extend() then all the elements which are contained in that argument get added to the list or in other words the argument will be iterated over. So, the length of the list will be incremented by the number of elements which have been added from another list. extend() has time complexity of O(n) where n is the number of the elements in an argument which has been passed to the extend().
Example: Let’s take an example of two lists as shown below.
list1 = [“Alpha”, “Beta”, “Gamma”] list2 = [“Delta”, “Eta”, “Theta”] list1.extend(list2) list1 will now become: [“Alpha”, “Beta”, “Gamma”, “Delta”, “Eta”, “Theta”].
The length of list1 will now become 6 in this scenario.
In Python, loops statements are used in order to do repetitive tasks with good efficiency. But in some scenarios, we need to come out of those loop statements or ignore some of the conditions. For these scenarios Python provides various loop control statements in order to take control over loops. These statements are as below.
In Python, SciPy is an open-source library which is used for the purpose of solving various engineering, mathematical, technical and scientific problems. We can easily manipulate data with the help of SciPy and perform the data visualisation with a wide number of high-level commands present in Python. SciPy is called as “Sign Pi”.
NumPy acts as the foundation of SciPy as it is built on it. SciPy libraries are built only to make sure that it can be able to work with the arrays of NumPy. Optimization and numeric integration are also possible by using the various numerical practices like routines which are provided by SciPy. In order to setup SciPy in your systems, below are syntax for the same for different operating systems.
Windows:
Syntax: Python3 -m pip install --user numpy scipy
Linux:
Syntax: sudo apt-get install python-scipy python-numpy
Mac:
Syntax: sudo port install py35-scipy py35-numpy
BETWEEN operator: In Python, BETWEEN operator is used in order to test whether the provided expression lies between a defined range of values or not. While testing, the range is inclusive. These values can be of any type like date, number or text. We can use this BETWEEN operator with SELECT, INSERT, DELETE, and UPDATE statements. The syntax to apply this operator is as below.
SELECT column_name(s) FROM table_name WHERE column_name BETWEEN value1 AND value2;
Output: It will return all the values from above column_name which lies between value1 and value2 including these 2 values also.
IN operator: In Python, IN operator is used to check whether an expression matches with some of values that has been specified in a list containing values. It can be used in order to eliminate the use of multiple OR conditions. We can also use NOT IN operator which functions exactly opposite to the IN operator to exclude certain rows from the output list. We can use this IN operator or NOT IN operator with SELECT, INSERT, DELETE, and UPDATE statements. The syntax to apply this operator is as below.
IN: SELECT column_name(s) FROM table_name WHERE column_name IN (list_of_values);
Output: It will return all the values from above column_name which matches with the specified “list_of_values”
NOT IN: SELECT column_name(s) FROM table_name WHERE column_name NOT IN (list_of_values);
Output: It will return all the values from above column_name excluding the specified “list_of_values”
In SQL, we can provide temporary names to either columns or table, these are called as aliases for a specific query. When we don’t want to use the original name of the table or column, then we use alias to provide temporary names to them. The scope of the alias is temporary and up to that specified query only.
We use alias to increase the readability of a column or a table name. This change is temporary and the original names that are stored in the database never get changed. Sometimes the names of table or column are complex so it is always preferred to use alias to give them an easy name temporarily. Below is the syntax to use alias for both table and column names.
Column Alias:
Syntax: SELECT column as alias_name FROM table_name;
Explanation: Here alias_name is the temporary name that is given to column name in the given table table_name.
Table Alias:
Syntax: SELECT column FROM table_name as alias_name;
Explanation: Here alias_name is the temporary name that is given to table table_name.
SQL injection is the process of inserting malicious SQL commands to the database that can exploit the user data stored in it. By inserting these statements, hackers actually take control of the database and can destroy and manipulate sensitive information stored in database. These SQL command insertions or SQL injection mainly happens using inputs through web pages which is one of the most common web hacking techniques.
In Web applications, usually web servers do communication with the database servers in order to retrieve or store data related to user in the database. Hackers input these malicious SQL codes which are executed once the web server tries to make connection with the database server resulting in compromising the security of the web application.
We can make use of Restricted access privileges and user authentication to avoid any security breach which may impact the critical data present in database. Another way is to avoid using system administrator accounts.
This is a common yet one of the most important data engineer interview questions and answers for experienced professionals, don't miss this one.
In SQL, Trigger acts a stored procedure which gets invoked when a triggering event occurs in a database. These triggering events can be caused due to insertion, deletion or updating of any row or column in a particular table. For example, trigger can be invoked when a new row is added or deleted from a table or any row is updated. The syntax to create a tigger in SQL is as below.
Syntax:
create trigger [trigger_name] [before | after] {insert | update | delete} on [table_name] [for each row] [trigger_body]
Explanation:
1. Trigger will be created with a name as [trigger_name] whose execution is determined by [before | after].
2. {insert | update | delete} are examples of DML operations.
3. [table_name] is the table which is associated with trigger.
4. [for each row] determines the rows for which trigger will be executed.
5. [trigger_body] determines the operations that needs to be performed after trigger is invoked.
Description: Data Engineering is very important term used in big data. It is the process of transforming the raw entity of data (data generated from various sources) into helpful information that can be used for various purposes. Data Engineering has become one of the most popular career choices today.
According to a study, it has been expected that the data engineering services and global big data will grow from USD 29.50 billion that was in 2017 to USD 77.37 billion by 2023, at a Compound Annual Growth Rate (CAGR) of 17.6% during the forecast period. 2017 is taken as the base year for this study, and the forecast period taken here is 2018–2023. Data engineer has to take up a lot of responsibilities daily, from collecting to analyzing data with the help of many tools.
If you are interested in data engineering and looking for top interview questions and answers in the field of data engineering, then these above beginner and advance level questions are best for you which keep into consideration various skills of data engineering like Python, Big data, Hadoop, SQL, Database, etc. Data analyst and data engineer jobs are increasing at a faster rate in the market and market has a lot opportunities for both freshers and experienced engineers across the world. Good conceptual knowledge and hold on logics will help you crack interviews in many reputed companies. The above questions are designed to help understand the concepts of data engineering deeply. We have tried to cover almost every topic of data engineering.
If you go through the above-mentioned, you will easily find questions from beginner to an advanced level according to your level of expertise. These questions will help you give an extra edge over the other applicants who will apply for data engineering jobs. If you want to study data engineering topics deeply, you can enroll in big data courses on KnowledgeHut that can help you to boost your basic and advanced skills.
Best of Luck.
Data Engineering is very important term used in big data. It is the process of transforming the raw entity of data (data generated from various sources) into helpful information that can be used for various purposes. Data Engineering has become one of the most popular career choices today and you can secure a career with instructor-led data engineer bootcamp training.
According to a study, it has been expected that the data engineering services and global big data will grow from USD 29.50 billion that was in 2017 to USD 77.37 billion by 2023, at a Compound Annual Growth Rate (CAGR) of 17.6% during the forecast period. 2017 is taken as the base year for this study, and the forecast period taken here is 2018–2023. Data engineer has to take up a lot of responsibilities daily, from collecting to analyzing data with the help of many tools.
If you are interested in data engineering and looking for top interview questions and answers in the field of data engineering, then these above beginner and advance level questions are best for you which keep into consideration various skills of data engineering like big data overview, Python, Big data, Hadoop, SQL, Database, etc. Data analyst and data engineer jobs are increasing at a faster rate in the market and market has a lot opportunities for both freshers and experienced engineers across the world. Good conceptual knowledge and hold on logics will help you crack interviews in many reputed companies. The above questions are designed to help understand the concepts of data engineering deeply. We have tried to cover almost every topic of data engineering. big data overview
If you go through the above-mentioned, you will easily find questions from beginner to an advanced level according to your level of expertise. These questions will help you give an extra edge over the other applicants who will apply for data engineering jobs. If you want to study data engineering topics deeply, you can enroll in big data courses on KnowledgeHut that can help you to boost your basic and advanced skills.
Best of Luck.
Submitted questions and answers are subjecct to review and editing,and may or may not be selected for posting, at the sole discretion of Knowledgehut.
Get a 1:1 Mentorship call with our Career Advisor
By tapping submit, you agree to KnowledgeHut Privacy Policy and Terms & Conditions