Accreditation Bodies
Accreditation Bodies
Accreditation Bodies
Supercharge your career with our Multi-Cloud Engineer Bootcamp
KNOW MOREThe DataStage tool is the product used for the ETL (Extract, Transform and Load) from various source servers, data files, applications, etc., to the destination systems where it could be used for storing and using the data to implement Business Intelligence. The tool was first developed by VMark in the mid-90s and acquired by IBM in 2005. Afterward, it was renamed IBM Infosphere DataStage. We'll be covering DataStage interview questions for all levels from beginners, to intermediate and experienced, with descriptive answers in this article. The topics covered are from ETL process, DataStage Architecture, Capabilities, Components, Operators, DataStage job configuring and scheduling. With DataStage Interview Questions and Answers, you could be confident and well- with the concepts for the upcoming interview. Here you could access the best resources which will lead you to crack the interview easily.
Filter By
Clear all
DataStage is a tool provided by IBM that is used to design, develop, and run applications that populate large volumes of data into data warehouses and datamarts by extracting data from diverse databases on Windows servers. It includes graphical visualizations for data integration and can also extract data from multiple sources and run scheduled jobs. As such, it is considered one of the most powerful ETL tools. DataStage has different versions that companies can use depending on their needs. The versions are Server Edition, MVS Edition, and Enterprise Edition. IBM DataStage tool was the first tool that introduced the parallelism concept.
Some of the notable capabilities of the DataStage tool are -
InfoSphere DataStage and QualityStage can access data in enterprise applications and data sources such as:
The aspects of IBM InfoSphere DataStage are -
ETL tool is required by the Data Engineers for configuring the data from one source to another. DataStage is a tool in the IBM InfoSphere Information Server Suite where we can create data pipelines from a source system to the destination system by designing the internal data transformations such as merging, joining, lookups, and clustering data to create meaningful data.
The data is required to populate the different data tables in a data warehouse, datamarts Staging tables where other systems could extract the data for data analysis, data science etc.
The system allows the creation of the scheduled job along with featuring parallel processing, where it shows performance exceptions from other such ETL tools.
The DataStage tool being the ETL (Extract, Transform, Load), suggests that there would be a place where the tool would extract the data from the source system (such as mainframe databases, OLAP databases, file system, Database tables, SQL Query in Oracle). There could be a minimum of one such source system where we would be getting the data, and after the transformation, the data prepared is populated into the multiple destination systems (database tables, file output, analytics tools, etc.), which is named as a destination in the tool.
The DataStage tool also supports Big data Hadoop by using tools such as an H-Base connector, JSON support, file system etc.
DataStage job is a code flow prepared by the developer, which is arranged in a process and stages defining the final data and requirement as asked.
In each stage, the task is broken into specific functionalities and is clubbed together in the tool and scheduled, which is named as DataStage job.
For example –
Suppose we have an objective to sum the sales amount. So, breaking it into stages, firstly, we would be performing the operation of the ‘group by’ function on the Sales data.
Nextly, in the next step, we would load that data into the destination table, so it would be designed in the next stage. Then using the DataStage links, we make the flow move from the ‘group by’ stage to the target loading stage.
dsjob command is used to get and display information about a specific project, job, phase, or link. Access to log files. You can use the dsjob command to add entries to a job's log file or to retrieve and display specific log entries.
Command | Functionality |
---|---|
$dsjob -run | Used to run the DataStage job |
$dsjob -stop | used to stop the job that is currently present in the process |
$dsjob -jobid | used for providing the job information |
$dsjob -report | used for displaying the complete job report |
$dsjob -lprojects | used for listing all the projects that are present |
$dsjob -ljobs | used for listing all the jobs that are present in the project |
$dsjob -lstages | used for listing all the stages of the current job |
$dsjob -llinks | used for listing all the links |
$dsjobs -lparams | used for listing all the parameters of the job |
$dsjob -projectinfo | used for retrieving information about the project |
$dsjob -log | used for adding a text message to the log |
$dsjob -jobinfo | used for the information retrieval of the job |
$dsjob -stageinfo | used for information retrieval at that stage of the job |
$dsjob –loginfo | used for getting information about the log |
A must-know for anyone looking for DataStage advanced interview questions, this is one of the frequent questions asked of managers as well.
IBM DataStage flow designer is a window client application where we can create, edit, schedule, trigger and perform operations. It provides the dashboard palette where we can drag/ drop components to create jobs and go to a few important features of the DataStage flow designer listed below -
We could trigger the code part through the designer tasks and link the different stages through DataStage links. The tool provides options such as using containers, creating, compiling, running jobs, editing, scheduling the existing jobs, Importing/ exporting jobs, loading columns, maintaining metadata of job info, updating configuration files, parameters etc. functionality which makes the job execution dynamic.
A tier is a logical grouping of components within the tool and the computers on which those components are installed.
Every tier contains a subset of the components that make up InfoSphere Information Server.
Tiers provide product module services, order fulfillment, and storage of metadata and other data.
There are mainly 6 service tiers-
Client Tier
It consists of client programs and consoles used for development, administration, and other tasks and the computers on which those programs and consoles are installed.
Services Tier
The Services Tier consists of Application Servers, General Services, and product services for suites and product modules. The services layer provides general services (such as security) and services specific to a particular product module. At the service level, IBM WebSphere Application Server hosts the service. It also hosts web-based InfoSphere Information Server applications.
Engine Level
The engine level is a logical grouping of engine components and communication agents, which helps in executing jobs and other tasks for product modules.
Repository Tier
The repository tier consists of the metadata repository and, if installed, other data stores to support other product modules. The metadata repository contains shared metadata, data, and configuration information for InfoSphere Information Server product modules. Other data stores contain extension data for use by supporting product modules. The operational database is the data store used by the engine operations console.
Microservices Layer
The Microservices Layer consists of many different microservices, each with a different role.
Layer Relationships
This layer provides storage for metadata and other data for services, job executions, and product modules that we install.
The common services in DataStage are mainly –
For the complex architecture, subsequent jobs could be created to run multiple jobs in combination with other jobs, which are called DataStage sequences. Sequence jobs allow for incorporating programming controls such as branching and looping into the flow.
In DataStage, a special type of job called a sequence job is present where a series of parallel or server jobs could be specified to run.
Different actions could be taken depending on whether the jobs in the sequence succeed or fail. After creating a successor job, we schedule it to run using the InfoSphere DataStage Director client, just like any parallel or server job.
The Merge stage joins tables based on the values of key columns as specified in the table definition.
The merge could have any number of input links and only a single output link, and there must be the same number of reject links as there are updated input links.
The Merge stage combines a master data set with the other data sets provided in its processing stage. It is one of three processes that join tables based on the values of the primary key of the table. The other two stages are:
The data files are the raw or exported data that is to be imported into the system and contain the information which is required by the user, whereas the descriptor file contains all the metadata of the data files, i.e., it contains the information and description about the data into the file provided.
Often, we require the descriptor file where we need the information at a brief as dealing with the large scale of data doesn’t provide us the flexibility to go through each data delivered by the source system.
Expect to come across this important DataStage question in your next interviews.
DataStage has three stages of processing that allow you to join tables based on key column values: lookup, join and merge. This post explains which phase to choose when, the differences between these phases, and development references for using these phases.
We use lookup levels when:
We use the join phase when:
We use merge levels when:
A data pipeline is a process of importing records from a source system and moving them through a set of processing functions (transformations) defined in a data flow (job). After the records pass through the pipeline, they can be processed without being written to a table
The requirement of data Pipelining is to -
Data partitioning is an approach to parallelism that divides a data set into partitions or subsets of the data set. In the absence of resource constraints or other data skew issues, data partitioning linearly improves application performance.
DataStage provides a capability where it automatically partitions data based on the partition type required by the phase.
Traditional packaging tools do not have this capability, so developers used to manually create data partitions and rewrite applications or data partitions scripts, which was costly and time-consuming when administrators wanted to use more hardware capacity.
The sort function can be used for removing duplicates in DataStage. While running the sort function, the user should specify the option that allows for duplicates and set it to false.
Steps to remove the duplicates are -
To access DataStage, we need to download and install the latest version of the IBM InfoSphere Server present on the web. The server supports AIX, Linux, and Windows operating systems and could be chosen as per the requirement.
We could migrate the data from an older version of infosphere to a new version by the tool - asset interchange tool,
Required Installation Files
For installing and configuring Infosphere DataStage, we need the following files in the local system.
Windows server,
EtlDeploymentPackage-windows-oracle.pkg EtlDeploymentPackage-windows-db2.pkg
Linux server,
EtlDeploymentPackage-linux-db2.pkg EtlDeploymentPackage-linux-oracle.pkg
IBM DataStage has 4 components – Administrator for managing Admin tasks, Manager for storing reusable Metadata, Designer for managing applications/ jobs, and Director for validating, scheduling, executing and monitoring DataStage server jobs and parallel jobs.
DataStage follows a client-server model as its architecture, and the main components of the client-server architecture are:
For setting up the projects, we start by creating the DataStage job on the DataStage server, and then all objects that we create in the meantime are stored in the DataStage project. The DataStage project serves as a separate environment on the server for the storage and maintenance of jobs, tables, definitions, and routines.
A DataStage project represents a separate environment on the installed server containing the jobs, tables, definitions, and routines created for the task.
The “Palette” window contains all the stages which are available in the DataStage tool. Based on the functionalities, multiple stages are provided from which the developer chooses.
Listing down the various categories of stages present in the Palette are - General, Data Quality, Database, Development, File, Processing, etc.
Different DataStage Processing Stages in the tool are the functionalities that help in the following:
A few examples of processing Stages are –
Transformer, Aggregator, FTP, Link Collector, Interprocess, Pivot, Sort etc., which helps in dealing with the data transformation and processing effectively.
A staple in DataStage interview questions and answers for experienced, be prepared to answer this one using your hands-on experience.
DataStage offers two sorting methods for parallel jobs.
By default, both methods use the same sort package (DataStage sort operator). Independent sorting offers more possibilities.
Link Sort
When the key splitting method is specified, this option cannot be used when sorting by the link option is specified in the input/split level option, and the automatic splitting method is used.
Key column usage:
Sorting phase
The standalone sorting phase has additional options that are not available for link sorting.
The following properties are not available when sorting by link:
We could use command-line functions to import/ export the DataStage job
Additionally, The XML format serves as the basis for moving jobs/dataflows between the Design Studio and DataStage environments. We could export a DataStage job to an XML file and then import it into the Design Studio to migrate the job from one server to another. The imported job is stored as a subflow in the designer.
A routine is a set of functions defined by DataStage Manager, which is routed through a transformer stage.
There are three types of routines:
Routines are stored in the Routines branch of the DataStage repository, where they can be created, viewed, or edited, which could be done through the Routine Dialog Box.
The system variables are the read-only variables that are used in the transformer stage or routine. They supply the system information, and these variables start with ‘@.’
Few system variables and their uses –
@FALSE | It replaces the value with 0. |
@TRUE | It replaces the value with 1. |
@INROWNUM | Number of input rows |
@OUTROWNUM | Number of output rows per link |
@NUMPARTITIONS | Total number of partitions for the stage. |
@PARTITIONNUM | Partition number for the particular instance. |
@ITERATION | Iteration number of the current loop |
A container is a group of stages and links. Containers allow you to simplify and modularize your job design by replacing complex areas of your diagram with single container stages.
These containers are available for both - parallel and server jobs.
IBM DataStage tool offers two types of containers.
Local container. They are created within a job and can only be accessed within that job. Local containers are edited on tabs in the job's graph window. Local containers can be used in server jobs or parallel jobs. Their primary use is to "clean up" the job design.
Shared containers. They are created separately and stored in repository-like jobs. These can be inserted into your job design.
The DataStage director tool is used for:
The DataStage Director window is split into two panes.
The Job Categories pane shows the ones on the repository.
The right pane shows one of his three views: Status View, Schedule View, or Log View.
DataStage Director has three view options.
InfoSphere DataStage supports different data types from Java and Hive data types. To complete the operation, the connector needs to map data types depending on the scenario in which the connector will be used. You can configure the Hive Connector stage to automatically add columns to the output link at runtime.
Hive Connector is a tool used to support partition mode when reading data. There are two ways to do this.
The DataStage tool provides support to connect to the HBase dataset and access their database components and tables in the DataStage. It is required to perform the following tasks:
Don't be surprised if this question pops up as one of the top DataStage technical interview questions in your next interview.
There are two basic types of parallelism. Pipelining and partitioning. The DataStage tool provides these two methods. For example, for a simple parallel job that extracts data from one data source, transforms it in some way, and then writes it to another data source. In any case, this job looks the same on the designer's canvas but can be configured to work in different ways depending upon the parallelism selected.
Pipeline parallelism
If you run a sample of his job on a system with 3 or more processors, the phase read he will start on one processor and start filling the pipeline with read data.
In parallel parallelism, the Datastage tool works like a continuous process where the process of transformation, cleaning and load process keeps working in parallel. It is not required that one process of upstream will stop while the downstream process is working and vice-versa.
This process helps in minimizing the risk usage, which is for the staging area. Additionally, it also reduces the idle time held on the processors working.
Partition Parallelism
Suppose you have the same simple pipelined parallelism job, processing a very large amount of data.
In partition parallelism, the input data stream is divided into various subsets referred to as partitions which are further processed using the individual processors by the same operation process.
There are 4 tiers –
splitsubrec restructure operator splits the input subrecords into a set of top-level output vector fields
splitvect restructure operator promotes the elements of a fixed-length vector to a set of top-level fields with the same name.
One of the most frequently posed scenario based DataStage interview questions and answers, be ready for this conceptual question.
First we look at the file name and how many Is there a record with wc - l filename is counted
Then first check if the file name has a header and footer, If unavailable, then use following Unix commands
when we use wc -l filename, it will give count, suppose for example there are 1000 records
We could use the below command using [head] to display the first line of a file.
$> head -1 file.txt
When we specify [head -2] then it will print only first 2 records of the file.
Otherwise, we could use [sed] command. [Sed] is a very powerful text editor which is used for various text manipulation purposes as asked here.
$> sed '2,$ d' file.txt
Note that in [sed], switch '$' refers to the last line always. So we could use the below command as it will read data from the last line and give output as expected.
$> sed –i '$ d' file.txt
Firstly, we use the following script using the [tail] command to get output.
$> tail -1 file.txt
If we use the [sed] command, then we use the following command. Remember, ‘$’ reads data from the last line.
$> sed -n '$ p' test
In [sed], by using the 'd' switch, we could delete a certain line from the output –. So if we want to delete the first line, the following command we use-
$> sed '1 d' file.txt
Here it will just print out all the lines except the first line in the file, and hence if we want to delete the first line from the file itself, we need to perform either of two options.
Either we can redirect the output of the file to another file and then renaming it back to original file as below:
$> sed '1 d' file.txt > new_file.txt $> mv new_file.txt file.txt
Or else, we could use [sed] switch '–i', which changes the file in place as per the below command.
$> sed –i '1 d' file.txt
If you want to delete line by line from a specific file, you can perform your task in a similar way as above. For example:
$> sed –i '5,7 d' file.txt
This is one of the most frequently asked DataStage interview questions for freshers in recent times.
This answer is specifically a part of DataStage performance tuning interview questions. For performance tuning of the DataStage job, the following practice must be used -
Additionally, selecting the correct configuration files, partitions and buffers. Handling sorting the data and handling zero-time values. You should try to copy, modify, or filter instead of using transformers. Need to reduce unnecessary metadata proliferation between phases.
A repository table is used to access and process the output of the ad-hoc, historical, analytical or complex query results. It could be centralized or distributed.
Prior to DataStage version 8, DataStage repository information was stored in files(uv). Starting with version 8, repository information is also stored in the database. The information is available in a database, making it easy to create queries and find details. During installation, you have the option to install the repository on either DB2 or Oracle.
DataStage repositories are typically created on DB2 databases. This is because Db2 comes standard with Information Server software.
For DataStage, we require the following in the server to set up.
This, along with other interview questions on DataStage, is a regular feature in DataStage interviews, be ready to tackle it with an approach mentioned below.
Following are the differences -
If an unknown error occurs during Job Sequencer execution, all stages after the exception activity will be executed. Exception activity in DataStage is, therefore, very important in case of such a situation.
For running the job in DataStage using the command line, we use the following command where we replace the specified project Name and job name –
dsjob -run -jobstatus <projectname> <jobname>
The only difference between hashed and sequential files is that hashed files store data about the hashing algorithm and hash key values, while sequential files do not have key values to store data.
Based on this hash key function, searching hash files is faster than sequential files.
There are two types of hash files:
NLS stands for National Language Support. This means that the DataStage tools support and can be used in multiple languages, including multi-byte character languages (Chinese, Japanese, Spanish). You can read and write in any language.
tagbatch restructure operator transforms the tagged fields into an output record whose schema supports all possible fields of a tag instance.
Tagswitch is used for the contents of tagged aggregates to be converted to InfoSphere DataStage-compatible records.-compatible records.
A collection library is a set of operators used to collect partitioned data.
The following collector types are available in the Collection Library.
A common yet one of the most important DataStage interview questions for experienced, don't miss this one.
InfoSphere DataStage automatically buffers links for certain stages. Its main purpose is to prevent deadlock conditions (where a stage cannot read its input because a previous stage in the job cannot write to its output).
Jobs with fork joins can lead to deadlock conditions. Here, the stage has two output connections whose data paths are merged later in the job. A situation can arise where all stages in the flow are waiting for other stages to read or write, so none of them can continue. No error or warning message is issued if a deadlock occurs. Your work will wait forever for input.
InfoSphere DataStage automatically inserts buffering into job flows containing branch joins that can lead to deadlock conditions. In most cases, you do not need to change the default buffering implemented by InfoSphere DataStage.
The DataStage player is the workhorse process. They are useful for parallel processing and are assigned to operators on each node.
Player: One or more logical groupings of processes used to execute data flow logic. All players are created as a group on the same server as the section leader process.
Job Design :
To achieve the solution, we design the job as shown in the below image. Firstly, we read the seq file as input, then data is passed through a Sort and then Transformer stage to achieve the output as needed in the file.
b) Sort Stage Properties
We would require to configure the sort stage where we sort the data based on column “Sq” in descending order.
c) Transformer Stage Properties
Here, we define 2 stage variables: StageVar, StageVar1 and following derivations -
d) Configuring Output File
Before capturing the data into a Sequential file, we need to sort the data again in ascending order to get the output as needed.
How to convert a single row to multiple rows.
Sample Input file-
Col1 | Col2 | NameA | NameB | NameC |
Abc | Def | Ram | Ben | Sed |
Sample O/P file data -
Col1 | Col2 | Name |
Abc | Def | Ram |
Abc | Def | Ben |
Abc | Def | Sed |
For designing this process, we will use the Transformer stage to flatten the input data and create multiple output rows for each input row.
In the transformer stage –
Configuring the Loop condition –
@ITERATION <= 3
As each input row has three columns of names, we need to process each input row three times so that it will create three separate output rows.
Set the following loop variable –
LoopVar1 we set by the following:
IF (@ITERATION = 1) THEN inlink.Name1
ELSE IF (@ITERATION = 2) THEN inlink.Name2
ELSE inlink.Name3
Defining the Output link metadata and derivations
Col1 - inlink.col1
Col2 - inlink.col2
Name - LoopVar1
We need to design the following job for achieving the task. Here we are reading sequential file as a input, then data is passing through a Sort and Transformer stage to achieve the asked.
In sort Stage –
We need to sort the data based on column “Char” in ascending order.
In transformer stage –
We define 2 stage variable : StageVar, StageVar1, StageVar2
With following derivations -
StageVar = If StageVar1=DSLink6.Char Then StageVar+1 Else1
StageVar1 = DSLink6.Char
Then, create a new column in output which will hold the value of Occurrence of characters and assigned the StageVar.
Occurrence = StageVar
Then, using the in-line sorting, we sort the data on "Occurrence" column in ascending order and populate the output as expected.
Usually, we use max value in a column of a table, but here asked is to get the max value from seq file.
For designing such job, we firstly read the data from flat file, following it we generate a dummy column in column generator stage, next, doing an aggregate on that and sending the output to seq file.
In the column generator stage, we generate the column DUMMY with the value 'X' for all the rows which we are reading from the input seq file. Additionally, we will set the generator for the Cycle algorithm with value 'X' and map the Salary and DUMMY columns in flow to the output of the Column Generator stage.
At the aggregator stage, we will then aggregate on the DUMMY column and then calculate the MAX of Salary. In the nextstep, we map the Max Salary to Output.
We read the sequential file, then do processing in transformer stage and then write the output to file as asked. The job design would be like –
Input data is read from the sequential file stage:
Nextly, we pass the input data to transformer stage where we separate the alphabets and numeric using 2 stage variables in transformer stage.
We need to define the stage variables like below:
svAlpha: Convert("0123456789","",lnk_alphanumerics.MixData)
svNumerics: Convert(svAlpha,"",lnk_alphanumerics.MixData)
Nextly we assign these stage variables to output columns as following.
svAlpha = Alpha
svNumerics = Number
In last stage, we write the data into the output file.
The JSP 2.0 Expression Language (EL) is a simple language that enables you to access object methods with arguments. To access a method with arguments in a JSP page, you need to specify the name of the object, followed by the method name and the arguments in parentheses.
For example, consider the following object with a format method that takes two arguments:
public class Formatter { public String format(String s, int i) { return s + ": " + i; } }
To access the format method and pass it two arguments in a JSP page, you can use the following expression:
${formatter.format("Number", 42)}
The expression will return the string "Number: 42", which is the result of the format method.
You can also pass EL expressions as arguments to object methods. For example, consider the following object with a format method that takes a variable number of arguments:
public class Formatter { public String format(Object... args) { return Arrays.toString(args); } }
To access the format method and pass it a variable number of arguments in a JSP page, you can use the following expression:
${formatter.format(1, 2, 3, 4, 5)}
The expression will return the string "[1, 2, 3, 4, 5]", which is the result of the format method.
To pass EL expressions as arguments to object methods, you can enclose the expressions in brackets:
${formatter.format([1 + 1], [2 + 2], [3 + 3])}
In this example, the expression will pass the values 2, 4, and 6 as arguments to the format method.
You can also use the fn:length function to pass the length of an array or a collection as an argument to a method:
${formatter.format(numbers.length)}
In this example, the expression will pass the length of the numbers array as an argument to the format method.
To summarize, the JSP EL makes it easy to access and invoke object methods with arguments in a JSP page. You can use it to pass arguments to methods, as well as to pass EL expressions and the length of arrays and collections as arguments.
DataStage is a tool provided by IBM that is used to design, develop, and run applications that populate large volumes of data into data warehouses and datamarts by extracting data from diverse databases on Windows servers. It includes graphical visualizations for data integration and can also extract data from multiple sources and run scheduled jobs. As such, it is considered one of the most powerful ETL tools. DataStage has different versions that companies can use depending on their needs. The versions are Server Edition, MVS Edition, and Enterprise Edition. IBM DataStage tool was the first tool that introduced the parallelism concept.
Some of the notable capabilities of the DataStage tool are -
InfoSphere DataStage and QualityStage can access data in enterprise applications and data sources such as:
The aspects of IBM InfoSphere DataStage are -
ETL tool is required by the Data Engineers for configuring the data from one source to another. DataStage is a tool in the IBM InfoSphere Information Server Suite where we can create data pipelines from a source system to the destination system by designing the internal data transformations such as merging, joining, lookups, and clustering data to create meaningful data.
The data is required to populate the different data tables in a data warehouse, datamarts Staging tables where other systems could extract the data for data analysis, data science etc.
The system allows the creation of the scheduled job along with featuring parallel processing, where it shows performance exceptions from other such ETL tools.
The DataStage tool being the ETL (Extract, Transform, Load), suggests that there would be a place where the tool would extract the data from the source system (such as mainframe databases, OLAP databases, file system, Database tables, SQL Query in Oracle). There could be a minimum of one such source system where we would be getting the data, and after the transformation, the data prepared is populated into the multiple destination systems (database tables, file output, analytics tools, etc.), which is named as a destination in the tool.
The DataStage tool also supports Big data Hadoop by using tools such as an H-Base connector, JSON support, file system etc.
DataStage job is a code flow prepared by the developer, which is arranged in a process and stages defining the final data and requirement as asked.
In each stage, the task is broken into specific functionalities and is clubbed together in the tool and scheduled, which is named as DataStage job.
For example –
Suppose we have an objective to sum the sales amount. So, breaking it into stages, firstly, we would be performing the operation of the ‘group by’ function on the Sales data.
Nextly, in the next step, we would load that data into the destination table, so it would be designed in the next stage. Then using the DataStage links, we make the flow move from the ‘group by’ stage to the target loading stage.
dsjob command is used to get and display information about a specific project, job, phase, or link. Access to log files. You can use the dsjob command to add entries to a job's log file or to retrieve and display specific log entries.
Command | Functionality |
---|---|
$dsjob -run | Used to run the DataStage job |
$dsjob -stop | used to stop the job that is currently present in the process |
$dsjob -jobid | used for providing the job information |
$dsjob -report | used for displaying the complete job report |
$dsjob -lprojects | used for listing all the projects that are present |
$dsjob -ljobs | used for listing all the jobs that are present in the project |
$dsjob -lstages | used for listing all the stages of the current job |
$dsjob -llinks | used for listing all the links |
$dsjobs -lparams | used for listing all the parameters of the job |
$dsjob -projectinfo | used for retrieving information about the project |
$dsjob -log | used for adding a text message to the log |
$dsjob -jobinfo | used for the information retrieval of the job |
$dsjob -stageinfo | used for information retrieval at that stage of the job |
$dsjob –loginfo | used for getting information about the log |
A must-know for anyone looking for DataStage advanced interview questions, this is one of the frequent questions asked of managers as well.
IBM DataStage flow designer is a window client application where we can create, edit, schedule, trigger and perform operations. It provides the dashboard palette where we can drag/ drop components to create jobs and go to a few important features of the DataStage flow designer listed below -
We could trigger the code part through the designer tasks and link the different stages through DataStage links. The tool provides options such as using containers, creating, compiling, running jobs, editing, scheduling the existing jobs, Importing/ exporting jobs, loading columns, maintaining metadata of job info, updating configuration files, parameters etc. functionality which makes the job execution dynamic.
A tier is a logical grouping of components within the tool and the computers on which those components are installed.
Every tier contains a subset of the components that make up InfoSphere Information Server.
Tiers provide product module services, order fulfillment, and storage of metadata and other data.
There are mainly 6 service tiers-
Client Tier
It consists of client programs and consoles used for development, administration, and other tasks and the computers on which those programs and consoles are installed.
Services Tier
The Services Tier consists of Application Servers, General Services, and product services for suites and product modules. The services layer provides general services (such as security) and services specific to a particular product module. At the service level, IBM WebSphere Application Server hosts the service. It also hosts web-based InfoSphere Information Server applications.
Engine Level
The engine level is a logical grouping of engine components and communication agents, which helps in executing jobs and other tasks for product modules.
Repository Tier
The repository tier consists of the metadata repository and, if installed, other data stores to support other product modules. The metadata repository contains shared metadata, data, and configuration information for InfoSphere Information Server product modules. Other data stores contain extension data for use by supporting product modules. The operational database is the data store used by the engine operations console.
Microservices Layer
The Microservices Layer consists of many different microservices, each with a different role.
Layer Relationships
This layer provides storage for metadata and other data for services, job executions, and product modules that we install.
The common services in DataStage are mainly –
For the complex architecture, subsequent jobs could be created to run multiple jobs in combination with other jobs, which are called DataStage sequences. Sequence jobs allow for incorporating programming controls such as branching and looping into the flow.
In DataStage, a special type of job called a sequence job is present where a series of parallel or server jobs could be specified to run.
Different actions could be taken depending on whether the jobs in the sequence succeed or fail. After creating a successor job, we schedule it to run using the InfoSphere DataStage Director client, just like any parallel or server job.
The Merge stage joins tables based on the values of key columns as specified in the table definition.
The merge could have any number of input links and only a single output link, and there must be the same number of reject links as there are updated input links.
The Merge stage combines a master data set with the other data sets provided in its processing stage. It is one of three processes that join tables based on the values of the primary key of the table. The other two stages are:
The data files are the raw or exported data that is to be imported into the system and contain the information which is required by the user, whereas the descriptor file contains all the metadata of the data files, i.e., it contains the information and description about the data into the file provided.
Often, we require the descriptor file where we need the information at a brief as dealing with the large scale of data doesn’t provide us the flexibility to go through each data delivered by the source system.
Expect to come across this important DataStage question in your next interviews.
DataStage has three stages of processing that allow you to join tables based on key column values: lookup, join and merge. This post explains which phase to choose when, the differences between these phases, and development references for using these phases.
We use lookup levels when:
We use the join phase when:
We use merge levels when:
A data pipeline is a process of importing records from a source system and moving them through a set of processing functions (transformations) defined in a data flow (job). After the records pass through the pipeline, they can be processed without being written to a table
The requirement of data Pipelining is to -
Data partitioning is an approach to parallelism that divides a data set into partitions or subsets of the data set. In the absence of resource constraints or other data skew issues, data partitioning linearly improves application performance.
DataStage provides a capability where it automatically partitions data based on the partition type required by the phase.
Traditional packaging tools do not have this capability, so developers used to manually create data partitions and rewrite applications or data partitions scripts, which was costly and time-consuming when administrators wanted to use more hardware capacity.
The sort function can be used for removing duplicates in DataStage. While running the sort function, the user should specify the option that allows for duplicates and set it to false.
Steps to remove the duplicates are -
To access DataStage, we need to download and install the latest version of the IBM InfoSphere Server present on the web. The server supports AIX, Linux, and Windows operating systems and could be chosen as per the requirement.
We could migrate the data from an older version of infosphere to a new version by the tool - asset interchange tool,
Required Installation Files
For installing and configuring Infosphere DataStage, we need the following files in the local system.
Windows server,
EtlDeploymentPackage-windows-oracle.pkg EtlDeploymentPackage-windows-db2.pkg
Linux server,
EtlDeploymentPackage-linux-db2.pkg EtlDeploymentPackage-linux-oracle.pkg
IBM DataStage has 4 components – Administrator for managing Admin tasks, Manager for storing reusable Metadata, Designer for managing applications/ jobs, and Director for validating, scheduling, executing and monitoring DataStage server jobs and parallel jobs.
DataStage follows a client-server model as its architecture, and the main components of the client-server architecture are:
For setting up the projects, we start by creating the DataStage job on the DataStage server, and then all objects that we create in the meantime are stored in the DataStage project. The DataStage project serves as a separate environment on the server for the storage and maintenance of jobs, tables, definitions, and routines.
A DataStage project represents a separate environment on the installed server containing the jobs, tables, definitions, and routines created for the task.
The “Palette” window contains all the stages which are available in the DataStage tool. Based on the functionalities, multiple stages are provided from which the developer chooses.
Listing down the various categories of stages present in the Palette are - General, Data Quality, Database, Development, File, Processing, etc.
Different DataStage Processing Stages in the tool are the functionalities that help in the following:
A few examples of processing Stages are –
Transformer, Aggregator, FTP, Link Collector, Interprocess, Pivot, Sort etc., which helps in dealing with the data transformation and processing effectively.
A staple in DataStage interview questions and answers for experienced, be prepared to answer this one using your hands-on experience.
DataStage offers two sorting methods for parallel jobs.
By default, both methods use the same sort package (DataStage sort operator). Independent sorting offers more possibilities.
Link Sort
When the key splitting method is specified, this option cannot be used when sorting by the link option is specified in the input/split level option, and the automatic splitting method is used.
Key column usage:
Sorting phase
The standalone sorting phase has additional options that are not available for link sorting.
The following properties are not available when sorting by link:
We could use command-line functions to import/ export the DataStage job
Additionally, The XML format serves as the basis for moving jobs/dataflows between the Design Studio and DataStage environments. We could export a DataStage job to an XML file and then import it into the Design Studio to migrate the job from one server to another. The imported job is stored as a subflow in the designer.
A routine is a set of functions defined by DataStage Manager, which is routed through a transformer stage.
There are three types of routines:
Routines are stored in the Routines branch of the DataStage repository, where they can be created, viewed, or edited, which could be done through the Routine Dialog Box.
The system variables are the read-only variables that are used in the transformer stage or routine. They supply the system information, and these variables start with ‘@.’
Few system variables and their uses –
@FALSE | It replaces the value with 0. |
@TRUE | It replaces the value with 1. |
@INROWNUM | Number of input rows |
@OUTROWNUM | Number of output rows per link |
@NUMPARTITIONS | Total number of partitions for the stage. |
@PARTITIONNUM | Partition number for the particular instance. |
@ITERATION | Iteration number of the current loop |
A container is a group of stages and links. Containers allow you to simplify and modularize your job design by replacing complex areas of your diagram with single container stages.
These containers are available for both - parallel and server jobs.
IBM DataStage tool offers two types of containers.
Local container. They are created within a job and can only be accessed within that job. Local containers are edited on tabs in the job's graph window. Local containers can be used in server jobs or parallel jobs. Their primary use is to "clean up" the job design.
Shared containers. They are created separately and stored in repository-like jobs. These can be inserted into your job design.
The DataStage director tool is used for:
The DataStage Director window is split into two panes.
The Job Categories pane shows the ones on the repository.
The right pane shows one of his three views: Status View, Schedule View, or Log View.
DataStage Director has three view options.
InfoSphere DataStage supports different data types from Java and Hive data types. To complete the operation, the connector needs to map data types depending on the scenario in which the connector will be used. You can configure the Hive Connector stage to automatically add columns to the output link at runtime.
Hive Connector is a tool used to support partition mode when reading data. There are two ways to do this.
The DataStage tool provides support to connect to the HBase dataset and access their database components and tables in the DataStage. It is required to perform the following tasks:
Don't be surprised if this question pops up as one of the top DataStage technical interview questions in your next interview.
There are two basic types of parallelism. Pipelining and partitioning. The DataStage tool provides these two methods. For example, for a simple parallel job that extracts data from one data source, transforms it in some way, and then writes it to another data source. In any case, this job looks the same on the designer's canvas but can be configured to work in different ways depending upon the parallelism selected.
Pipeline parallelism
If you run a sample of his job on a system with 3 or more processors, the phase read he will start on one processor and start filling the pipeline with read data.
In parallel parallelism, the Datastage tool works like a continuous process where the process of transformation, cleaning and load process keeps working in parallel. It is not required that one process of upstream will stop while the downstream process is working and vice-versa.
This process helps in minimizing the risk usage, which is for the staging area. Additionally, it also reduces the idle time held on the processors working.
Partition Parallelism
Suppose you have the same simple pipelined parallelism job, processing a very large amount of data.
In partition parallelism, the input data stream is divided into various subsets referred to as partitions which are further processed using the individual processors by the same operation process.
There are 4 tiers –
splitsubrec restructure operator splits the input subrecords into a set of top-level output vector fields
splitvect restructure operator promotes the elements of a fixed-length vector to a set of top-level fields with the same name.
One of the most frequently posed scenario based DataStage interview questions and answers, be ready for this conceptual question.
First we look at the file name and how many Is there a record with wc - l filename is counted
Then first check if the file name has a header and footer, If unavailable, then use following Unix commands
when we use wc -l filename, it will give count, suppose for example there are 1000 records
We could use the below command using [head] to display the first line of a file.
$> head -1 file.txt
When we specify [head -2] then it will print only first 2 records of the file.
Otherwise, we could use [sed] command. [Sed] is a very powerful text editor which is used for various text manipulation purposes as asked here.
$> sed '2,$ d' file.txt
Note that in [sed], switch '$' refers to the last line always. So we could use the below command as it will read data from the last line and give output as expected.
$> sed –i '$ d' file.txt
Firstly, we use the following script using the [tail] command to get output.
$> tail -1 file.txt
If we use the [sed] command, then we use the following command. Remember, ‘$’ reads data from the last line.
$> sed -n '$ p' test
In [sed], by using the 'd' switch, we could delete a certain line from the output –. So if we want to delete the first line, the following command we use-
$> sed '1 d' file.txt
Here it will just print out all the lines except the first line in the file, and hence if we want to delete the first line from the file itself, we need to perform either of two options.
Either we can redirect the output of the file to another file and then renaming it back to original file as below:
$> sed '1 d' file.txt > new_file.txt $> mv new_file.txt file.txt
Or else, we could use [sed] switch '–i', which changes the file in place as per the below command.
$> sed –i '1 d' file.txt
If you want to delete line by line from a specific file, you can perform your task in a similar way as above. For example:
$> sed –i '5,7 d' file.txt
This is one of the most frequently asked DataStage interview questions for freshers in recent times.
This answer is specifically a part of DataStage performance tuning interview questions. For performance tuning of the DataStage job, the following practice must be used -
Additionally, selecting the correct configuration files, partitions and buffers. Handling sorting the data and handling zero-time values. You should try to copy, modify, or filter instead of using transformers. Need to reduce unnecessary metadata proliferation between phases.
A repository table is used to access and process the output of the ad-hoc, historical, analytical or complex query results. It could be centralized or distributed.
Prior to DataStage version 8, DataStage repository information was stored in files(uv). Starting with version 8, repository information is also stored in the database. The information is available in a database, making it easy to create queries and find details. During installation, you have the option to install the repository on either DB2 or Oracle.
DataStage repositories are typically created on DB2 databases. This is because Db2 comes standard with Information Server software.
For DataStage, we require the following in the server to set up.
This, along with other interview questions on DataStage, is a regular feature in DataStage interviews, be ready to tackle it with an approach mentioned below.
Following are the differences -
If an unknown error occurs during Job Sequencer execution, all stages after the exception activity will be executed. Exception activity in DataStage is, therefore, very important in case of such a situation.
For running the job in DataStage using the command line, we use the following command where we replace the specified project Name and job name –
dsjob -run -jobstatus <projectname> <jobname>
The only difference between hashed and sequential files is that hashed files store data about the hashing algorithm and hash key values, while sequential files do not have key values to store data.
Based on this hash key function, searching hash files is faster than sequential files.
There are two types of hash files:
NLS stands for National Language Support. This means that the DataStage tools support and can be used in multiple languages, including multi-byte character languages (Chinese, Japanese, Spanish). You can read and write in any language.
tagbatch restructure operator transforms the tagged fields into an output record whose schema supports all possible fields of a tag instance.
Tagswitch is used for the contents of tagged aggregates to be converted to InfoSphere DataStage-compatible records.-compatible records.
A collection library is a set of operators used to collect partitioned data.
The following collector types are available in the Collection Library.
A common yet one of the most important DataStage interview questions for experienced, don't miss this one.
InfoSphere DataStage automatically buffers links for certain stages. Its main purpose is to prevent deadlock conditions (where a stage cannot read its input because a previous stage in the job cannot write to its output).
Jobs with fork joins can lead to deadlock conditions. Here, the stage has two output connections whose data paths are merged later in the job. A situation can arise where all stages in the flow are waiting for other stages to read or write, so none of them can continue. No error or warning message is issued if a deadlock occurs. Your work will wait forever for input.
InfoSphere DataStage automatically inserts buffering into job flows containing branch joins that can lead to deadlock conditions. In most cases, you do not need to change the default buffering implemented by InfoSphere DataStage.
The DataStage player is the workhorse process. They are useful for parallel processing and are assigned to operators on each node.
Player: One or more logical groupings of processes used to execute data flow logic. All players are created as a group on the same server as the section leader process.
Job Design :
To achieve the solution, we design the job as shown in the below image. Firstly, we read the seq file as input, then data is passed through a Sort and then Transformer stage to achieve the output as needed in the file.
b) Sort Stage Properties
We would require to configure the sort stage where we sort the data based on column “Sq” in descending order.
c) Transformer Stage Properties
Here, we define 2 stage variables: StageVar, StageVar1 and following derivations -
d) Configuring Output File
Before capturing the data into a Sequential file, we need to sort the data again in ascending order to get the output as needed.
For designing this process, we will use the Transformer stage to flatten the input data and create multiple output rows for each input row.
In the transformer stage –
Configuring the Loop condition –
@ITERATION <= 3
As each input row has three columns of names, we need to process each input row three times so that it will create three separate output rows.
Set the following loop variable –
LoopVar1 we set by the following:
IF (@ITERATION = 1) THEN inlink.Name1
ELSE IF (@ITERATION = 2) THEN inlink.Name2
ELSE inlink.Name3
Defining the Output link metadata and derivations
Col1 - inlink.col1
Col2 - inlink.col2
Name - LoopVar1
We need to design the following job for achieving the task. Here we are reading sequential file as a input, then data is passing through a Sort and Transformer stage to achieve the asked.
In sort Stage –
We need to sort the data based on column “Char” in ascending order.
In transformer stage –
We define 2 stage variable : StageVar, StageVar1, StageVar2
With following derivations -
StageVar = If StageVar1=DSLink6.Char Then StageVar+1 Else1
StageVar1 = DSLink6.Char
Then, create a new column in output which will hold the value of Occurrence of characters and assigned the StageVar.
Occurrence = StageVar
Then, using the in-line sorting, we sort the data on "Occurrence" column in ascending order and populate the output as expected.
Usually, we use max value in a column of a table, but here asked is to get the max value from seq file.
For designing such job, we firstly read the data from flat file, following it we generate a dummy column in column generator stage, next, doing an aggregate on that and sending the output to seq file.
In the column generator stage, we generate the column DUMMY with the value 'X' for all the rows which we are reading from the input seq file. Additionally, we will set the generator for the Cycle algorithm with value 'X' and map the Salary and DUMMY columns in flow to the output of the Column Generator stage.
At the aggregator stage, we will then aggregate on the DUMMY column and then calculate the MAX of Salary. In the nextstep, we map the Max Salary to Output.
We read the sequential file, then do processing in transformer stage and then write the output to file as asked. The job design would be like –
Input data is read from the sequential file stage:
Nextly, we pass the input data to transformer stage where we separate the alphabets and numeric using 2 stage variables in transformer stage.
We need to define the stage variables like below:
svAlpha: Convert("0123456789","",lnk_alphanumerics.MixData)
svNumerics: Convert(svAlpha,"",lnk_alphanumerics.MixData)
Nextly we assign these stage variables to output columns as following.
svAlpha = Alpha
svNumerics = Number
In last stage, we write the data into the output file.
The JSP 2.0 Expression Language (EL) is a simple language that enables you to access object methods with arguments. To access a method with arguments in a JSP page, you need to specify the name of the object, followed by the method name and the arguments in parentheses.
For example, consider the following object with a format method that takes two arguments:
public class Formatter { public String format(String s, int i) { return s + ": " + i; } }
To access the format method and pass it two arguments in a JSP page, you can use the following expression:
${formatter.format("Number", 42)}
The expression will return the string "Number: 42", which is the result of the format method.
You can also pass EL expressions as arguments to object methods. For example, consider the following object with a format method that takes a variable number of arguments:
public class Formatter { public String format(Object... args) { return Arrays.toString(args); } }
To access the format method and pass it a variable number of arguments in a JSP page, you can use the following expression:
${formatter.format(1, 2, 3, 4, 5)}
The expression will return the string "[1, 2, 3, 4, 5]", which is the result of the format method.
To pass EL expressions as arguments to object methods, you can enclose the expressions in brackets:
${formatter.format([1 + 1], [2 + 2], [3 + 3])}
In this example, the expression will pass the values 2, 4, and 6 as arguments to the format method.
You can also use the fn:length function to pass the length of an array or a collection as an argument to a method:
${formatter.format(numbers.length)}
In this example, the expression will pass the length of the numbers array as an argument to the format method.
To summarize, the JSP EL makes it easy to access and invoke object methods with arguments in a JSP page. You can use it to pass arguments to methods, as well as to pass EL expressions and the length of arrays and collections as arguments.
The DataStage interview generally focuses on the ETL process and what intermediate transformations one could add to achieve the desired result as asked. Apart from the DataStage interview questions with answers listed above, the following top tips during the DataStage interview are to go through the different tiers and tools in the IBM Infosphere and revise the knowledge about each. Visit our online Database courses to learn DataStage and other database tools with hands-on projects to make you job ready.
According to Credly, the top companies hiring DataStage developers in India are:
Data Engineers looking for a high-end ETL tool and working with the IBM InfoSphere Information Server Suite product can apply to these roles -
While reading the important questions, we need to keep the focus on the core concept and the transformation tools and capabilities that DataStage tool provides.
For Data Engineers role -
For Admin roles -
You need to learn the Directory system, job tools, logs, repository etc. and where they are stored, and how they should be configured.
A few common DataStage questions one should expect and always be prepared for in such kinds of interviews are –
Advice for beginners would be to have knowledge about the process and be clear with your previous role and answer the questions on the process and structure as in the above questions rather than mentioning the tool everywhere. You can always through the above-mentioned DataStage scenario based questions with answers.
The article provides DataStage's latest interview questions and answers, which will be sufficiently resourceful in clearing out the interview. DataStage developer interview questions mentioned in this article are a few of the most frequently asked, so it is recommended to get a hold of these as well.
Also, you could visit the other online database courses on KnowledgeHut. Visit here for more Database online courses.
IBM InfoSphere DataStage tool is an ETL tool, and it is used in Data engineering applications. The article focuses on providing the gist and overview of the DataStage Interview questions which could be asked during an interview.
The article lists all aspects such as architecture, capabilities, functions, and applications provided by the tool, and it is advised to go through the structure of the tool if you are a fresher giving an interview to properly understand what you would be answering about. The engineers working must go through the DataStage partitioning interview questions and also DataStage production support interview questions, as these are the major roles in the IT sector.
ETL tools generally work with the common principle of Extracting data from one or more sources and application of the internal transformation to join, filter, update, delete, assemble, merge, save the data in-between, and finally, store the data in the destination system as tables, reports or any other application feed.
IBM provides a suite similar to other companies like Microsoft provides MSBI tools (SQL Server Integration Services, Reporting Services, Analysis Services) where we could perform similar kinds of applications.
The TechSuite of IBM has the capability to store, manage and ease the work of the client using the application, and they have different applications for managing the jobs, managing the access, managing the reports, managing the metadata etc., whose details could be found in the above application.
To summarize, IBM DataStage is a powerful tool for designing, developing, and running applications to extract data from databases and populate data warehouses. Below are the four main phases of DataStage. Administrators are used for administrative tasks, including deleting DataStage user settings and criteria, mobilizing and demobilizing projects, and more. The designer or design interface develops data stage applications or jobs managed by the Director and executed by the server. As the name suggests, managers manage and manage repositories and allow users to modify stored data about repositories. The director performs various functions, such as verifying, scheduling, executing jobs, and monitoring concurrent jobs. It supports big data and can be accessed in many ways, including JDBC Integrator, JSON Support, and Distributed File System.
The interview questions are prepared and classified based on the level of expertise one holds, and the reader can gain sufficient knowledge before facing the interviewer. There are many such articles about the best Database course to take up as Data jobs are in ample number in the market.
Submitted questions and answers are subjecct to review and editing,and may or may not be selected for posting, at the sole discretion of Knowledgehut.
Get a 1:1 Mentorship call with our Career Advisor
By tapping submit, you agree to KnowledgeHut Privacy Policy and Terms & Conditions