Get the most out of your data with CDH, the industry’s leading modern data management platform. Built entirely on open standards, CDH features a suite of innovative open source technologies to store, process, discover, model, serve, secure and govern all. Cloudera Machine Learning Virtual Machine. Cloudera Quickstart 5.12 Virtual Machine provisioned with Machine Learning and streaming tools. This instance shows how to provide the CDH virtual machine with any other tools that could be required for Machine Learning purposes. Cloudera VM – cloudera-quickstart-vm-5.4.2-0-virtualbox. Step 1: Attach Cloudera Quickstart VM to Oracle Virtual Box. Download latest Cloudera Quickstart VM from cloudera.com site and attach it to Oracle VirtualBox; Before your start Cloudera Quickstart VM right click on that VM in Oracle Virtualbox and click on Settings (check below image). The Cloudera Quickstart VM is a Virtual Machine that comes with a pseudo distributed version of Hadoop preinstalled on it along with the main services that are offered by Cloudera. This includes the Cloudera Manager and Impala as the most notable.
- Cloudera Hadoop Vm Download Free
- Cloudera Vmware Quickstart
- Cloudera Quickstart Vm Vmware
- Cloudera Quickstart Vm 5.13.0 0 Vmware
- Cloudera Quickstart Vm 5.12 Download
- Cloudera Quickstart Vm 6.0
- Cloudera Quickstart Vm Virtualbox
The purpose of this post is to provide instructions on how to get started with the Cloudera Quickstart VM and what are some of the main things to know about the VM. This includes where to find certain configuration files, how to setup certain things that will make your life easier and more.
Overview
The Cloudera Quickstart VM is a Virtual Machine that comes with a pseudo distributed version of Hadoop preinstalled on it along with the main services that are offered by Cloudera. This includes the Cloudera Manager and Impala as the most notable.
Some Requirements
- Make sure your computer is setup to allow virtualization. This can be set in your bios on startup.
- To use the Cloudera Manager, you will need to allocate 10GB to your VM and 2 Virtual CPU Cores.
- The Cloudera Manager comes disabled by default, and all the Hadoop daemons are started up on startup and run just fine without it. so you don’t absolutely need the Cloudera Manager.
Cloudera Hadoop Vm Download Free
Downloads
General Downloads
Latest Quickstart VM
Official Documentation
Importing into VirtualBox
- Download the Quickstart VM with the above links
- Open VirtualBox
- Click on File -> Import Appliance
- Select the Quickstart VM you just download
- Click Continue
- Optional: Double click on the name, and change it to whatever you want.
- Click Import
- Wait for the machine to import and when it is done, it will be list in the window to startup
Recommended VirtualBox Configurations
- Right click on the VirtualMachine and click Settings
- Setup the VM to allow you to copy and paste from that machine to your local and vice-versa
- Click on General -> Advanced
- Set Shared Clipboard to Bidirectional
- Setup port forwarding from port 2222 to port 22 to allow SSH to the machine
- Click on Network -> Advanced -> Port Forwarding
- Add a new entry
- Name: 2222
- Host Port: 2222
- Guest Port: 22
SSH’ing to the Machine
Default SSH Credentials: cloudera/cloudera
Host to connect to: localhost
Because of the Recommended VirtualBox Configuration above, we’re forwarding connections from port 2222 to 22. So you would want to use port 2222 to connect.
Linux/Mac
- Open a command line terminal
- Use the ssh command to login
- Enter the password
Windows
- Open putty
- Set localhost as the Host Name
- Set 2222 as the port
- Connection Type: SSH
- Click open
- Enter the password
Setup password-less SSH (Optional)
- Generate a public and private key locally
- You can follow these instructions:
- Login to the machine with the instructions above
- create the ~/.ssh directory
- Create the file ~/.ssh/authorized_keys
- Open file
- Add your public key to the authorized_keys file
- Save the authorized_keys file
- Change permissions of .ssh
- Change permissions of the ~/.ssh/authorized_keys
- Change permissions of: chmod 740 /home/cloudera/
- Now if you try SSH’ing to the machine, you shouldn’t have to provide the password
Copying Files to the VM
SCP
- Open a command line terminal
- Use the following command:
FileZilla or anther FTP App
- Open your desired FTP Application
- Create a new connection
- Host: localhost
- Username: cloudera
- Password: cloudera
- Port: 22
- Connect
Configure Apache Spark to Connect to Hive
If you’re intending to use Apache Spark, you will also probably want to connect to Hive using SparkSQL so you can interact with that relational store. To do this you need to include the hive-site.xml file in the spark configurations so Spark knows how to interact with Hive. If you don’t do this, the app will still run, but you wont be able to view the same tables you have in Hive and you wont be able to store data in tables.
- SSH into the Machine
- Login as root
- Create a symlink to Link the hive-site.xml in the spark conf directory
Configure Apache Spark History Server to allow you to view previously ran Spark jobs
If you’re intending to use Apache Spark, you may end up trying to view past runs via the Apache Spark History Server. There is a small issue right off the bat with the Quickstart VM where you can’t view past runs, because of a permissions issue with the applicationHistory directory in HDFS (/user/spark/applicationHistory). The spark user, is not able to read the contents of the directory. You can follow these steps to fix this:
- SSH into the Machine
- Login as hdfs user
- Run “$ sudo su” to login as root, then “$ su hdfs”
- Change the permissions of the applicationHistory directory under the spark home directory in hdfs
- Now when you visit the Apache Spark History server you will see any past jobs that have ran
Using Beeline to connect to Hive
Beeline is a new command line shell that is supported by HiveServer2. It is recommended to use this over the normal hive shell since it supports better security and functionality.
Credentials
cloudera/cloudera
Starting Shell with beeline Command
This will start the beeline shell.
Note: If you were to run a command such as “show tables” to list the hive tables in the currently selected database at this time you will get the following error:
No current connection
This is because you haven’t technically connected to the HiveServer2 to be able to run hive commands.
To connect you can run the following command. This will prompt you for credentials.
To avoid having to enter credentials each time, you can include the username and password in the connect statement like so:
Starting Shell with beeline Command and arguments
Instead of having to use the connect command upon starting the beeline shell, you can automatically connect to the HiveServer2 using command line arguments.
Shutting down the Shell
Cloudera Manager
URL: http://quickstart.cloudera:7180/cmf/home
Credentials: cloudera/cloudera
Hue
URL: http://quickstart.cloudera:8888/accounts/login/
Credentials: cloudera/cloudera
Resource Manager
URL: http://quickstart.cloudera:8088/cluster
Credentials: None
Job History
URL: http://quickstart.cloudera:19888/jobhistory
Credentials: None
HBase Master UI
URL: http://quickstart.cloudera:60010/master-status
Credentials: None
Oozie UI
URL: http://quickstart.cloudera:11000/oozie/
Credentials: None
Apache Solr
URL: http://quickstart.cloudera:8983/solr/#/
Credentials: None
Apache Spark History
URL: http://quickstart.cloudera:18088/
Credentials: None
MySQL
Host: localhost
Credentials: root/cloudera
Example Connection
$ mysql -u root -p
cloudera
Beeline
Host: localhost
Cloudera Vmware Quickstart
Port: 10000
Credentials: cloudera/cloudera
Cloudera Quickstart Vm Vmware
Example Connection
$ beeline -u jdbc:hive2://localhost:10000/default -n cloudera -p cloudera
Configuration Files:
Databricks community edition is an excellent environment for practicing PySpark related assignments. However, if you are not satisfied with its speed or the default cluster and need to practice Hadoop commands, then you can set up your own PySpark Jupyter Notebook environment within Cloudera QuickStart VM as outlined below. I use MAC environment for my work, but Windows is an equally viable option.
Cloudera in VirtualBox
Cloudera Quickstart Vm 5.13.0 0 Vmware
- Download Oracle VirtualBox and follow the installation instructions for your platform. This will be the container in which Cloudera QuickStart VM can run.
- Download Cloudera Quickstarts and follow the installation instructions for your platform.
- Import Cloudera .ovf file into VirtualBox.
Anaconda in Cloudera Quickstart
- Bring up Oracle VM VirtualBox Manager application.
- Select the Cloudera Quickstart VM and click on the Start button.
- Within Cloudera Quickstart VM, using a browser download Anaconda 64bit for Python 2.x into ~/Downloads.
- Open a terminal window, and install Anaconda using:
Cloudera Quickstart Vm 5.12 Download
- Accept all prompts.
- Make a copy of .bashrc file.
Cloudera Quickstart Vm 6.0
- Using an editor, add this export command to .bashrc to associate jupyter notebook with pyspark command.
Cloudera Quickstart Vm Virtualbox
- Source .bashrc or close/open Terminal.
- In the Terminal, To start using Jupyter notebook with Spark, type: