APTRON publishes the list of Best Hadoop Interview Questions and Answers and Best Hadoop Interview Questions asked in different interview sessions conducted at various MNCs. The best Hadoop training institute, APTRON has acquired the title ‘best’ by providing guaranteed 100% placement assistance to the students. During the Hadoop training and certification; the Hadoop trainers impart know-how skills, and develop decision making scenarios in the lab to provide first-hand Hadoop experience to the students.
Our 10+ years experienced trainers work on overall Hadoop training and development of the students by conducting mock-interview sessions after Hadoop Course. Personality development, spoken English, and presentation skills are the key factors on which the training sessions are held to boost the confidence of the students. Therefore, such Hadoop training and coaching course assists students in securing a quick job in an MNC.
APTRON is one of the most credible Hadoop training institute offering hands on practical knowledge and full job assistance with basic as well as advanced level Hadoop Course. At APTRON Hadoop Training is conducted by subject specialist corporate professionals with 10+ years of experience in managing real-time projects.
Here are list of Hadoop interview questions asked and answers given in sessions mentioned below:
Hadoop Interview Questions | Hadoop Interview Answers |
For each YARN job, the Hadoop framework generates task log file. Where are Hadoop task log files stored? | On the local disk of the slave mode running the task. |
You want to node to only swap Hadoop daemon data from RAM to disk when absolutely necessary. What should you do? | Delete the /swapfile file on the node. |
Your cluster is running MapReduce version 2 (MRv2) on YARN. Your ResourceManager is configured to use the FairScheduler. Now you want to configure your scheduler such that a new user on the cluster can submit jobs into their own queue application submission. Which configuration should you set? | You can specify new queue name when user submits a job and new queue can be created dynamically if the property yarn.scheduler.fair.allow-undecleared-pools = true. |
You observed that the number of spilled records from Map tasks far exceeds the number of map output records. Your child heap size is 1GB and your io.sort.mb value is set to 1000MB. How would you tune your io.sort.mb value to achieve maximum memory to disk I/O ratio? | Tune the io.sort.mb value until you observe that the number of spilled records equals (or is as close to equals) the number of map output records. |
You are running a Hadoop cluster with a NameNode on host mynamenode, a secondary NameNode on host mysecondarynamenode and several DataNodes. Which best describes how you determine when the last checkpoint happened? | Connect to the web UI of the Secondary NameNode (https://mysecondary:50090/) and look at the “Last Checkpoint” information. |
You want to understand more about how users browse your public website. For example, you want to know which pages they visit prior to placing an order. You have a server farm of 200 web servers hosting your website. Which is the most efficient process to gather these web server across logs into your Hadoop cluster analysis? | Ingest the server web logs into HDFS using Flume. |
On a cluster running CDH 5.0 or above, you use the hadoop fs –put command to write a 300MB file into a previously empty directory using an HDFS block size of 64 MB. Just after this command has finished writing 200 MB of this file, what would another use see when they look in directory? | They will see the file with a ._COPYING_ extension on its name. If they view the file, they will see contents of the file up to the last completed block (as each 64MB block is written, that block becomes available). |
Table schemas in Hive are | Stored along with the data in HDFS |
For each YARN job, the Hadoop framework generates task log file. Where are Hadoop task log files stored? | On the local disk of the slave mode running the task |
You wany to node to only swap Hadoop daemon data from RAM. to disk when absolutely necessary. What shouldyou do? files stored? | Delete the /swapfile on the node. |
You are configuring your cluster to run HDFS and MapReducer v2 (MRv2) on YARN. Which two daemons needs to be installed on your cluster’s master nodes? files stored? | ResourceManager, NameNode. |
You observed that the number of spilled records from Map tasks far exceeds the number of map output records. Your child heap size is 1GB and your io.sort.mb value is set to 1000MB. How would you tune your io.sort.mb value to achieve maximum memory to disk I/O ratio? files stored? | Tune the io.sort.mb value until you observe that the number of spilled records equals (or is as close to equals) the number of map output records. |
You want to understand more about how users browse your public website. For example, you want to know which pages they visit prior to placing an order. You have a server farm of 200 web servers hosting your website. Which is the most efficient process to gather these web server across logs into your Hadoop cluster analysis? | Ingest the server web logs into HDFS using Flume |
Which YARN daemon or service monitors a Controller’s per-application resource using (e.g.,memory CPU)? | ApplicationMaster. |
Which scheduler would you deploy to ensure that your cluster allows short jobs to finish within a reasonable time without starting long-running jobs? | Fair Scheduler. |
Your cluster is configured with HDFS and MapReduce version 2 (MRv2) on YARN. What is the result when you execute: hadoop jar SampleJar MyClass on a client machine? | SampleJar.Jar is sent to the ApplicationMaster which allocates a container for SampleJar.Jar. |
You are working on a project where you need to chain together MapReduce, Pig jobs. You also need the ability to use forks, decision points, and path joins. Which ecosystem project should you use to perform these actions? | Oozie. |
Which process instantiates user code, and executes map and reduce tasks on a cluster running MapReduce v2 (MRv2) on YARN? | NodeManager. |
Which two features does Kerberos security add to a Hadoop cluster? | User authentication on all remote procedure calls (RPCs), Root access to the cluster for users hdfs and mapred but non-root access for clients. |
Assuming a cluster running HDFS, MapReduce version 2 (MRv2) on YARN with all settings at their default, what do you need to do when adding a new slave node to cluster? | Nothing, other than ensuring that the DNS (or/etc/hosts files on all machines) contains any entry for the new node. |
Which YARN daemon or service negotiations map and reduce Containers from the Scheduler, tracking their status and monitoring progress? | ApplicationMaster. |
During the execution of a MapReduce v2 (MRv2) job on YARN, where does the Mapper place the intermediate data of each Map Task? | The Mapper stores the intermediate data on the underlying filesystem of the local disk in the directories yarn.nodemanager.locak-DIFS. |
You suspect that your NameNode is incorrectly configured, and is swapping memory to disk. Which Linux commands help you to identify whether swapping is occurring? | free, top, vmstat. |
On a cluster running CDH 5.0 or above, you use the hadoop fs –put command to write a 300MB file into a previously empty directory using an HDFS block size of 64 MB. Just after this command has finished writing 200 MB of this file, what would another use see when they look in directory? | They will see the file with a ._COPYING_ extension on its name. If they view the file, they will see contents of the file up to the last completed block (as each 64MB block is written, that block becomes available). |
Which command does Hadoop offer to discover missing or corrupt HDFS data? | Hdfs fsck. |
You are planning a Hadoop cluster and considering implementing 10 Gigabit Ethernet as the network fabric. Which workloads benefit the most from faster network fabric? | When your workload generates a large amount of output data, significantly larger than the amount of intermediate data. |
Your cluster is running MapReduce version 2 (MRv2) on YARN. Your ResourceManager is configured to use the FairScheduler. Now you want to configure your scheduler such that a new user on the cluster can submit jobs into their own queue application submission. Which configuration should you set? | You can specify new queue name when user submits a job and new queue can be created dynamically if the property yarn.scheduler.fair.allow-undecleared-pools = true |
A slave node in your cluster has 4 TB hard drives installed (4 x 2TB). The DataNode is configured to store HDFS blocks on all disks. You set the value of the dfs.datanode.du.reserved parameter to 100 GB. How does this alter HDFS block storage? | All hard drives may be used to store HDFS blocks as long as at least 100 GB in total is available on the node. |
What two processes must you do if you are running a Hadoop cluster with a single NameNode and six DataNodes, and you want to change a configuration parameter so that it affects all six DataNodes? | You must modify the configuration files on the NameNode only. DataNodes read their configuration from the master nodes, You must restart the NameNode daemon to apply the changes to the cluster. |
You have installed a cluster HDFS and MapReduce version 2 (MRv2) on YARN. You have no dfs.hosts entry(ies) in your hdfs-site.xml configuration file. You configure a new worker node by setting fs.default.name in its configuration files to point to the NameNode on your cluster, and you start the DataNode daemon on that worker node. What do you have to do on the cluster to allow the worker node to join, and start sorting HDFS blocks? | Without creating a dfs.hosts file or making any entries, run the commands hadoop.dfsadminrefreshModes on the NameNode |
You use the hadoop fs –put command to add a file “sales.txt” to HDFS. This file is small enough that it fits into a single block, which is replicated to three nodes in your cluster (with a replication factor of 3). One of the nodes holding this file (a single block) fails. How will the cluster handle the replication of file in this situation? | The file will be re-replicated automatically after the NameNode determines it is under-replicated based on the block reports it receives from the NameNodes |
You are configuring a server running HDFS, MapReduce version 2 (MRv2) on YARN running Linux. How must you format underlying file system of each DataNode? | They must be formatted as either ext3 or ext4. |
You are migrating a cluster from MApReduce version 1 (MRv1) to MapReduce version 2 (MRv2) on YARN. You want to maintain your MRv1 TaskTracker slot capacities when you migrate. What should you do? | Configure yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpuvcores to match the capacity you require under YARN for each NodeManager? | On a cluster running MapReduce v2 (MRv2) on YARN, a MapReduce job is given a directory of 10 plain text files as its input directory. Each file is made up of 3 HDFS blocks. How many Mappers will run? | 10. |
You’re upgrading a Hadoop cluster from HDFS and MapReduce version 1 (MRv1) to one running HDFS and MapReduce version 2 (MRv2) on YARN. You want to set and enforce version 1 (MRv1) to one running HDFS and MapReduce version 2 (MRv2) on YARN. You want to set and enforce a block size of 128MB for all new files written to the cluster after upgrade. What should you do? | Set dfs.block.size to 128 M on all the worker nodes and client machines, and set the parameter to final. You do not need to set this value on the NameNode. |
Your Hadoop cluster is configuring with HDFS and MapReduce version 2 (MRv2) on YARN. Can you configure a worker node to run a NodeManager daemon but not a DataNode daemon and still have a functional cluster? | Yes. The daemon will get data from another (non-local) DataNode to run Map tasks. |
You have A 20 node Hadoop cluster, with 18 slave nodes and 2 master nodes running HDFS High Availability (HA). You want to minimize the chance of data loss in your cluster. What should you do? | Run the ResourceManager on a different master from the NameNode in order to load-share HDFS metadata processing. |
You decide to create a cluster which runs HDFS in High Availability mode with automatic failover, using Quorum Storage. What is the purpose of ZooKeeper in such a configuration? | It only keeps track of which NameNode is Active at any given time. |
Your cluster implements HDFS High Availability (HA). Your two NameNodes are named nn01 and nn02. What occurs when you execute the command: hdfs haadmin –failover nn01 nn02? | nn01 is fenced, and nn02 becomes the active NameNode |
You have a Hadoop cluster HDFS, and a gateway machine external to the cluster from which clients submit jobs. What do you need to do in order to run Impala on the cluster and submit jobs from the command line of the gateway machine? | Install the impalad daemon on each machine in the cluster, the statestored daemon and catalogd daemon on one machine in the cluster, and the impala shell on your gateway machine. |
You have just run a MapReduce job to filter user messages to only those of a selected geographical region. The output for this job is in a directory named westUsers, located just below your home directory in HDFS. Which command gathers these into a single file on your local file system? | Hadoop fs –getemerge westUsers westUsers.txt |
In CDH4 and later, which file contains a serialized form of all the directory and files inodes in the filesystem, giving the NameNode a persistent checkpoint of the filesystem metadata? | Fsimage_N (where N reflects transactions up to transaction ID N) |
You are running a Hadoop cluster with a NameNode on host mynamenode. What are two ways to determine available HDFS space in your cluster? | Run hdfs dfsadmin –report and locate the DFS Remaining value |
You have recently converted your Hadoop cluster from a MapReduce 1 (MRv1) architecture to MapReduce 2 (MRv2) on YARN architecture. Your developers are accustomed to specifying map and reduce tasks (resource allocation) tasks when they run jobs: A developer wants to know how specify to reduce tasks when a specific job runs. Which method should you tell that developers to implement? | Developers specify reduce tasks in the exact same way for both MapReduce version 1 (MRv1) and MapReduce version 2 (MRv2) on YARN. Thus, executing –D mapreduce.job.reduces-2 will specify reduce tasks. |
Your Hadoop cluster contains nodes in three racks. You have not configured the dfs.hosts property in the NameNode’s configuration file. What results? | Any machine running the DataNode daemon can immediately join the cluster |
You are running a Hadoop cluster with MapReduce version 2 (MRv2) on YARN. You consistently see that MapReduce map tasks on your cluster are running slowly because of excessive garbage collection of JVM, how do you increase JVM heap size property to 3GB to optimize performance? | mapreduce.map.java.opts=-Xms3072m |
Your company stores user profile records in an OLTP databases. You want to join these records with web server logs you have already ingested into the Hadoop file system. What is the best way to obtain and ingest these user records? | Ingest with sqoop import. |
Which two are features of Hadoop’s rack topology? | Hadoop gives preference to intra-rack data transfer in order to conserve bandwidth , Rack location is considered in the HDFS block placement policy. |
Furthermore, in case you are a student or professional and wanted to get Hadoop training in a world class environment, you can get in touch with APTRON Solutions. The Hadoop training institute provides more than 250+ IT and NON-IT training courses to the students. Apart from this, the training institute offers hardware, software, networking, computer training courses with IT software, JAVA, PHP, .NET, courses via the professional experienced team.