Q1. What exactly is Hadoop?
A1. Hadoop is a Big Data framework to process huge amount of different types of data in parallel to achieve performance benefits.
Q2. What are 5 Vs of Big Data ?
A2. Volume – Size of the data
Velocity – Speed of change of data
Variety – Different types of data : Structured, Semi-Structured, Unstructured data.
Q3. Give me examples of Unstructured data.
A3. Images, Videos, Audios etc.
Q4. Tell me about Hadoop file system and processing framework.
A4. Hadoop files system is called as HDFS – Hadoop distributed file system. It consists of Name Node, Data Node and Secondary Name Node.
Hadoop processing framework is known as MapReduce. It caters Map and Reduce tasks that get scheduled in parallel to achieve efficiency.
Q5/ What is High Availability feature in Hadoop2.
A5. In Hadoop 2 Passive Name Node is introduced to avoid NameNode becoming single point of failure. This results into High Availability of Hadoop cluster.
Q6. What is Federation.
A6. Federation is introduced in Hadoop 2 to cater multiple NameNodes in Hadoop cluster. This makes NameNode horizontally scalable and allows to cater huge amount of Meta Data.
Q7. What is MetaData ?
A7. MetaData is data about data. Name Node caters MetaData in Hadoop cluster – information about files in HDFS.
Q8. What are the main components in Hadoop Eco-System and what are their functions ?
A8. Here is a list of Hadoop Eco-System components –
1. HDFS – distributed File System
2. MapReduce – programming paradigm – based on Java
3. Pig- to process and analyse the structured and semi-structured data
4. Hive – to process and analyse structured data
5. HBASE – NOSQL database
6. SQOOP – Import/Export structured data
7. Oozie – Scheduler
Q9. Tell me some major benefits of Hadoop?
A9. Some major benefits of Hadoop are –
b. Ability to handle multiple data types
c. Ability to handle big data
d. Common platform for machine learning/business intelligence/datawarehousing etc.
Q10. How Hadoop is cost-effective?
A10. Hadoop is used with commodity hardware and is open-source. So, it provides a cost-effective solution from both hardware and software fronts.
Q11. What is the block size in Hadoop?
A11. Block size in Hadoop 1 is 64 kb and in Hadoop 2 is 128 kb.
Q12. Please tell me the NameNode port number
A12. Its 50070.
Q13. What is the default replication factor in HDFS ?
A13. Default replication factor is 3.
Q14. What is the command to change the replication factor ?
A14. Replication factor can be changed using SETREP command.
Q15. Tell me two most commonly used commands in HDFS.
A15. Get command and put command.
Q16. What are the common types of NOSQL data bases ?
A16. These are –
a. Columnar database.
b. Document database.
c. Graph database.
Q17. Give me an example of document database ?
Q18. Give me the examples of Columnar database ?
A18. Cassandra and HBASE.
Q19. Tell me about the execution modes of Apache Pig.
A19. Pig can be executed in local and MapReduce modes.
Q20. How would you import data from MYSQL into HDFS ?
A 20. Using Sqoop.
Q21. What are the Hadoop features extended to its eco-system components ?
A 21. High Availability, Horizontal Scalability and Replication/Data Redundancy.
Q 22. What is input split in Hadoop?
A 22. Input split is the logical block size. This is the data size to be processed by map task. Number of mappers equals to number of input splits.
Q 23. How would you compare Cassandra with HBASE.
A 23. Cassandra is a High Availability NOSQL database while HBASE is a high consistency NOSQL database.
Q 24. What is CAP Theorem ?
A 24. CAP theorem talks about Consistency, Availability and Partitioning. As per the theorem – for any database at any point of time any two of these properties can be maintained.
Q 25. Tell me the properties of RDBMS as per CAP Theorem.
A 25. RDBMS has availability and consistency, however it lacks in partition tolerance.