Installation

This page is the starting point of your TDP journey. We’ll walk you through how to deploy and leverage a fully featured cluster from a single command. All files and instructions used to provision the virtual machines (VMs) and to deploy TDP are given in the next sections.

The deployment relies on the official tdp-getting-started repository which provides scripts and configurations to launch a TDP environment on top of a cluster of VMs.

Note: This cluster aims to be used for testing purposes only. Refer to the Documentation for production-ready clusters.

Requirements

Hardware requirements

The Getting Started deploys a cluster of 6 VMs. We recommend having at least the following hardware specifications:

  • CPU: 8 cores
  • Storage: 3 GB space. Performance may suffer if hard disk usage is higher than 75%. NVMe SSDs are a plus.
  • Memory: 32 GB

Note: If required, update the machine resources assigned to the VMs in the ./inventory/group_vars/all.yml file of your cloned repository.

Note: This guide provides instructions for Ubuntu systems. It has also been tested with NixOS and MacOS. Only x86_64 systems have been used for now. Alternative Linux distributions, as well as Windows, shall work.

Software requirements

The following needs to be installed to launch the cluster:

  • Git
    Git is the most used version control system. All TDP repositories are versioned. For instance, Git is used by the setup.sh script.

    # Update the package list and install git
    sudo apt update && sudo apt install git
    
  • Python3 release 3.6 or newer
    Python3 is a programming language, widely used to manage data. Most UNIX system comes with Python pre-installed, check it using the following command:

    # Display the version of Python
    python3 -V
    

    If not, install Python3 using the following command:

    # Update the package list and install python3
    sudo apt update && sudo apt install python3
    
  • venv
    venv is Python3 package for creating lightweight virtual environments. TDP creates a virtual environment to manage all the Python requirements: Ansible, JMESPath and Mitogen.

    • Ansible: Configuration management tool. TDP is deployed using Ansible Playbooks.
    • Mitogen: Python library for writing distributed self-replicating programs. Used as layer and module runtime for Ansible.
    • JMESPath: Python package used to query JSON.

    Note: Do not install the Python requirements by yourself outside the virtual environment.

    # Update the package list and install venv
    sudo apt update && sudo apt install python3-venv
    
  • VirtualBox - release 6.1.26 or newer
    Virtualbox is a powerful virtualization software.

    # Update the package list and install Virtualbox
    sudo apt update && sudo apt install virtualbox
    
  • Vagrant - release 2.2.19 or newer
    Vagrant is used to provision machines on top of VirtualBox.

    # Install the Hashicorp trusted key
    wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg
    # Install the Hashicorp source
    echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
    # Update the package list and install Vagrant
    sudo apt update && sudo apt install vagrant
    
  • Zip and Unzip
    zip and unzip are popular packages to manage compressed files. They are used during the deployment section.

    # Update the package list and install zip and unzip
    sudo apt update && sudo apt install zip unzip
    
  • jq
    jq is required to execute helpers script during deployment.

    # Update the package list and install jq
    sudo apt update && sudo apt install jq
    

Deployment

The steps below will deploy a TDP cluster with Vagrant using the parameters in the inventory directory. The Ansible host.ini file will be generated using the hosts variable in tdp-vagrant/vagrant.yml.

Prerequisites

Clone the tdp-getting-started project

git clone https://github.com/TOSIT-IO/tdp-getting-started.git
cd tdp-getting-started

TDP getting started provides a setup.sh script to set-up:

./scripts/setup.sh -e extras -e prerequisites -e vagrant

Note: Use the release tag to choose which version of the collections to use (-r stable | latest).

Activate Python virtual environment

source ./venv/bin/activate

Enable Mitogen

export ANSIBLE_STRATEGY_PLUGINS="$(python -c 'import os,ansible_mitogen; print(os.path.dirname(ansible_mitogen.__file__))')/plugins/strategy"
export ANSIBLE_STRATEGY="mitogen_linear"

Launch the virtual machines using Vagrant

vagrant up

Configure the TDP prerequisites using ansible_collections/tosit/tdp_prerequisites/playbooks/all.yml. This playbook deploys the services: Chrony, CA, LDAP, KDC, PostgreSQL.

ansible-playbook ansible_collections/tosit/tdp_prerequisites/playbooks/all.yml

Deploy

There are two ways to deploy a TDP cluster, using TDP lib CLI or using Ansible playbooks.

Note: The deployment takes about 45 minutes with a decent internet connection.

Deploy with TDP lib CLI

Deploy TDP cluster core and extras services

tdp deploy

Deploy with Ansible playbook

The ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/all.yml playbook deploys the following services: Exporter, ZooKeeper, Hadoop core (HDFS, YARN, MapReduce), Ranger, Hive, Spark (2 and 3), HBase and Knox.

ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/all.yml

To deploy extras services use the following commands

ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/livy.yml
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/livy-spark3.yml
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/zookeeper-kafka.yml
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/kafka.yml

Configure HDFS and Ranger

Once the deployment is done, HDFS user home directories and Ranger policies are configured with dedicated playbooks:

Configure HDFS user home directories:

ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml

Configure Ranger policies:

ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml

Next steps

Connect to the cluster

The nodes are accessible via SSH with the vagrant ssh {fqdn} command. For example:

vagrant ssh edge-01.tdp

Access the UI

All service UIs are kerberized for security purposes. Your host must be configured to enable SPNEGO with your web browser.

We also recommend you configure your /etc/hosts file to access the UIs more easily.

Refer to our host configuration page for detailed instructions.

Learn to use TDP

Refer to our docs to learn how to manage and administrate your cluster.

Now that you have a cluster up and running, you can follow our tutorials to learn how to use its components: