Loom - Teradata Developer Exchange

advertisement
Loom Installation
Valid for: Loom 2.0.0+
Contents
Conceptual Overview of Installation............................................................................................................. 2
Prerequisites ................................................................................................................................................. 2
First-time Installation .................................................................................................................................... 6
Download and Install Loom ...................................................................................................................... 6
Start Loom................................................................................................................................................. 9
Upgrade....................................................................................................................................................... 11
Back Up Current Registry ........................................................................................................................ 11
Download and Install Loom .................................................................................................................... 12
Stop and Start Loom ............................................................................................................................... 13
Restore Registry ...................................................................................................................................... 15
1
Advanced Configuration ............................................................................................................................. 16
Security, User Impersonation, and Authentication ................................................................................ 16
ActiveScan: Potential Sources ................................................................................................................. 16
CSV Recognizer....................................................................................... Error! Bookmark not defined.
Log File Recognizer................................................................................. Error! Bookmark not defined.
Custom Metadata Properties.................................................................................................................. 17
Post-Installation “Smoke Test” ...................................................................... Error! Bookmark not defined.
Troubleshooting ......................................................................................... Error! Bookmark not defined.
Steps........................................................................................................... Error! Bookmark not defined.
Revision History ............................................................................................. Error! Bookmark not defined.
Conceptual Overview of Installation
●
Download the Loom distribution
●
Edit configuration files
●
Start the Loom server
In these instructions:
●
Edit the red text before executing commands.
●
Blue text highlights content of interest.
Prerequisites
Consult your system administrator as needed for the following prerequisites.
1. A Hadoop cluster running on Linux machines.
a. Loom 2.0+ has been tested on the following Hadoop distributions. Loom supports MRv2
(YARN) as well as MRv1.
2
Distributor
Version
Cloudera
CDH 5.1
Hortonworks
HDP 2.1
Teradata
TDH 2.1
b. Operating Systems: Linux. Loom has been run on Ubuntu, CentOS, RHEL, and SLES.
c. Browsers: Chrome and Firefox.
2. Choosing an installation location for Loom
a. On the cluster
i.
It is recommended that you install Loom on the NameNode, for simplicity in
managing permissions. However, Loom can be run on any node in the cluster.
b. Off the cluster
i.
Loom can also be run outside the cluster a machine that can communicate with
the Hadoop APIs but is not itself running any Hadoop services (commonly
known as an “edge” node).
ii. It is not necessary for users on the machine to be able to access HDFS from the
command line, but this machine will need to have a copy of the same Hadoop
distribution files as the cluster – in particular, the libraries for Hadoop, Hive, and
HCatalog.
3. Local Username/Permissions
a. On both the machine where you still be running Loom and on all nodes in the cluster,
create a dedicated Linux username for Loom. The alphanumeric ID, numeric user ID
(UID), and group ID (GID) for the user must be the same across machines.
i.
This user will be referred to as loomuser throughout this document, but it can
have any name.
ii. Depending on Loom security settings (see Advanced Configuration > Security),
this will be the username interacting directly with Hadoop services.
b. Grant loomuser sudo privileges.
3
i.
This is not absolutely necessary, but if you choose not to do so, you will need
access to another username with sudo privileges in order to change ownership
of the directory where Loom is downloaded.
c. On the machine where Loom will be running, grant loomuser ownership of the following
local directory
The default location for local temporary files created by Hive when
executed by the loomuser userid. This may be overridden in the
“hive.exec.local.scratchdir” property of hive-site.xml.
file:/tmp/loomuser
d. Set HIVE_HOME, HADOOP_HOME, and HCAT_HOME environment variables for
loomuser. These variables should be set permanently for loomuser, or specified in
loom-server.sh, but should not just be set for the current shell session.
i.
These variables should be set to the directories that contain the Hive, Hadoop,
and HCatalog “lib” directories, respectively, and should NOT have a trailing
slash.
1. The exact values will vary depending on your Hadoop distribution.
Examples are below, but you should confirm that the Hive and Hadoop
“lib” files are actually located at the paths below.
a. Typical example for Hortonworks
HIVE_HOME=/usr/lib/hive
HCAT_HOME=/usr/lib/hive-hcatalog
HADOOP_HOME=/usr/lib/hadoop
b. Typical example for CDH4 as installed by Cloudera Manager
HIVE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive
HCAT_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hcatalog
4
HADOOP_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop
c. For TDH, the following additional environment variable is
needed
PATH=$PATH:/opt/teradata/jvm64/jdk7/bin
4. Hadoop Username/Permissions
a. Grant loomuser read and write access to the following HDFS directory:
hdfs:/user/hive/warehouse
The default location of the Hive warehouse. This may be overridden
in the “hive.metastore.warehouse.dir” property of hivesite.xml file.
b. Create and grant loomuser ownership of the following HDFS directories:
hdfs:/tmp/hive-loomuser
The default location for temporary files created by Hive when
executed by the loomuser userid. This may be overridden in the
“hive.exec.scratchdir” property of hive-site.xml.
hdfs:/user/loomuser
The home directory for loomuser on HDFS.
c. Grant loomuser read and write access to any HDFS directories where the user will want
to browse, query, or output new data.
5. Hive
a. Install Hive with a multi-user metastore, such as MySQL or PostegreSQL.
5
i.
If Hive was installed as a demo, it is probably using the default Apache Derby
metastore, which is single-user. Your Hadoop distributor should have
instructions on switching Hive to use a non-Derby metastore.
6. Networking
a. Ports: The port on which Loom will run (8080 by default, but you can specify any port at
runtime) must be exposed such that intended users of Loom will be able to access that
port through their web browser.
7. Web Browser
a. The latest versions of Firefox and Chrome are compatible with Loom. Internet Explorer
is not supported.
First-time Installation
That is, on a cluster where Loom has never been installed:
Download and Install Loom
1. Open an SSH session on the machine where you are going to install Loom.
2. Create a loom directory wherever you want Loom installed (e.g. /usr/local), transfer ownership
to loomuser, and cd into it.
loomuser@node:~$ cd /usr/local
loomuser@node:/usr/local$ mkdir loom
loomuser@node:/usr/local$ sudo chown -R loomuser /usr/local/loom
loomuser@node:/usr/local$ cd loom
3. Download Loom x.y.z (for example, 1.2.7) and unzip
loomuser@node:/usr/local/loom$ wget --no-check-certificate
6
http://www.revelytix.com/transfer/loom-x.y.z-distribution.zip; unzip loomx.y.z-distribution.zip
4. Run the bin/check-setup.sh script.
a. For MapR users: you will need to uncomment (i.e. delete the pound sign before) the
following line in bin/check-setup.sh, in order to include certain native dependencies.
loom-x.y.z-distribution/bin/check-setup.sh
# MapR requires native dependencies
JAVA_LIB_PATH="-Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64"
b. You only need to run this script once: before the first time you start Loom.
loomuser@node:/usr/local/loom$ loom-x.y.z-distribution/bin/check-setup.sh
# Example output
loomuser@node:/usr/local/loom$ loom-x.y.z-distribution/bin/check-setup.sh
Checking Loom configuration
Checking default loom port ........ port '8080' on host 'localhost' ... OK.
Checking availability of datomic transactor port ........ port '4334' on
host 'localhost' ... OK.
Checking default Hadoop FileSystem ........ configured to use
hdfs://localhost:8020 ... OK.
Checking default Hadoop JobTracker ........ configured to use JobTracker
'localhost' port '50030' ... OK.
7
Loom is ready to run.
c. If “default loom port” check fails:
i.
The default port for Loom Server is 8080, but Loom can easily be run on a
different port. Instructions are included in the documentation below, starting
with the phrase “To run this server on a different port...”
d. If “availability of datomic transactor” check fails:
i.
This means another application is running on port 4334, 4335, or 4336. If you
cannot remove the application, it is possible to configure Loom to start the
transactor on a different set of three contiguous ports. Open loom-x.y.zdistribution/lib/datomic/transactor.properties, and set ‘port’ to the first port in
the sequence you want to use:
loom-x.y.z-distribution/lib/datomic/transactor.properties
########### free mode config ###############
protocol=free
host=localhost
#free mode will use 3 ports starting with this one:
port=<firstport>
ii. You may also be seeing this error if you have started Loom on this machine
before; as mentioned above, it is only necessary to run checkup.sh before the
first time you start Loom. Once you start Loom, the transactor runs as a
background process on ports 4334-4336, and will keep running on these ports in
between restarts of the Loom server.
e. If “default Hadoop FileSystem” check fails: either you did not set HADOOP_HOME
correctly (see Prerequisites > Username/Permissions) or HDFS is not running.
f.
If “default Hadoop JobTracker” check fails, either you did not set HADOOP_HOME
correctly (see Prerequisites > Username/Permissions) or JobTracker is not running.
5. Set Loom’s DistributedCache directory
8
a. In loom-x.y.z-distribution/config/loom.properties, set loom.dist.cache to the desired
HDFS directory. It will default to hdfs:/user/${user.name}/loom-dist-cache unless
otherwise changed, where ${user.name} is the name of the user who starts the loom
server.
# Sets the location in HDFS where Loom manages the distributed cache that it
# uses to configure MapReduce jobs that it submits. The Loom server process
# must have permission to write in this location.
loom.dist.cache=<distributedcachepath>
b. IMPORTANT: <distributedcachepath> must BOTH be an absolute path (not a URI) for an
HDFS folder AND ALSO end with a "/" For example:
/user/loom/ ACCEPTABLE
/user/loom NOT ACCEPTABLE
loom/ NOT ACCEPTABLE
hdfs://master:9000/user/loom/ NOT ACCEPTABLE
6. At this point, if you want to take advantage of Loom’s advanced configuration options, see the
“Advanced Configuration” section and complete the relevant steps before proceeding to the
next step below.
Start Loom
1. For MapR users: you will need to uncomment (i.e. delete the pound sign before) the following
line in bin/loom-server.sh, in order to include certain native dependencies.
loom-x.y.z-distribution/bin/loom-server.sh
9
# MapR requires native dependencies
JAVA_LIB_PATH="-Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64"
2. Start the Loom Server.
a. IMPORTANT: always run the loom-server.sh script from the current distribution
directory, e.g. /usr/local/loom/loom-x.y.z-distribution. Loom has certain dependencies
that require to be started from the distribution directory
b. These examples use ‘nohup’ plus ‘&’ to run Loom in the background. You can also run
Loom from a ‘screen’ window, if you have the ‘screen’ package installed.
loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ nohup ./bin/loomserver.sh &
[hit ENTER to regain command-line access]
a. To run this server on a different port, before starting the Loom server, include the port
number after loom-server.sh
loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ nohup ./bin/loomserver.sh <port#> &
# Example
loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ nohup ./bin/loomserver.sh 8081 &
b. Check the contents of nohup.out. Once Loom has is running, you will see the message,
“Loom server started.”
10
loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ tail -f nohup.out
-h gives a list of usages/options
Starting Database...
HADOOP_CP=<HADOOP_CP>
HIVE_CP=<HIVE_CP>
Starting Loom Server...
Starting Loom Server on port 8080
Loom Server started
Congratulations! You have now installed Loom.
Upgrade
For a cluster where Loom has already been installed:
Back Up Current Registry
1. To make a copy of your existing registry, run the backup.sh script from the distribution
directory. By default, <host>=localhost, <port>=8080, and <outputfile>=backup.json.
loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ ./bin/backup.sh -h
<host> -p <port> -o <backupfile>
a. This will produce a backup.json file in the distribution directory
11
loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ ls
backup.json bin config data datomic.pid
plugins R README.txt registry schema
docs
lib
license
logs
Download and Install Loom
1. Open an SSH session on the machine where you have installed Loom.
2. Cd into the loom directory.
loomuser@node:~$ cd /usr/local/loom
3. Load Loom a.b.c (for example, 1.0.1) onto node and unzip.
loomuser@node:/usr/local/loom$ wget --no-check-certificate
http://www.revelytix.com/transfer/loom-a.b.c-distribution.zip; unzip looma.b.c-distribution.zip
4. Set Loom’s DistributedCache directory.
a.
In loom-x.y.z-distribution/config/loom.properties, set loom.dist.cache to the desired
HDFS directory. It will default to hdfs:/user/${user.name}/loom-dist-cache unless
otherwise changed, where ${user.name} is the name of the user who starts the loom
server.
# Sets the location in HDFS where Loom manages the distributed cache that it
# uses to configure MapReduce jobs that it submits. The Loom server process
# must have permission to write in this location.
loom.dist.cache=<distributedcachepath>
12
b.
IMPORTANT: <distributedcachepath> must BOTH be an absolute path (not a URI) for an
HDFS folder AND ALSO end with a "/" For example:
/user/loom/ ACCEPTABLE
/user/loom NOT ACCEPTABLE
loom/ NOT ACCEPTABLE
hdfs://master:9000/user/loom/ NOT ACCEPTABLE
5. See the “Advanced Configuration” section in this document for instructions on additional
configuration options.
Stop and Start Loom
1. Find the PID of the currently running Loom server.
a. If you have sudo permissions:
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ sudo netstat -tnlp |
grep <port#>
# Example
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ sudo netstat -tnlp |
grep 8080
[sudo] password for loomuser:
tcp6
0
18139/java
0 :::8080
:::*
LISTEN
13
b. If you do not have sudo permissions, you can use an alternative method; the Loom
server process will be the first process returned
loomuser@node:/usr/local/loom$ ps aux | grep revelytix.servlet
# Example
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ps aux | grep
revelytix.servlet
loomuser
18139 0.8 13.6 1051396 280792 ?
Sl
10:00
1:20 java XX:PermSize=128m -XX:MaxPermSize=256m -Dtransactor.props=loom-0.6.1distribution/bin/../lib/datomic/transactor.properties -cp loom-0.6.1distribution/bin/../config:loom-0.6.1-distribution/bin/../lib/*:loom-0.6.1distribution/bin/../lib/ext/*:/opt/cloudera/parcels/CDH-4.2.01.cdh4.2.0.p0.10/lib/hadoop/etc/hadoop::/opt/cloudera/parcels/CDH-4.2.01.cdh4.2.0.p0.10/lib/hadoop/client-0.20/* revelytix.servlet
loomuser
19380 0.0 0.0
7624
color=auto revelytix.servlet
932 pts/0
S+
12:30
0:00 grep --
2. Kill the currently running Loom server.
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ kill <PID>
3. If you are upgrading Loom, you must stop the transactor processes. You can skip this step if you
are simply restarting the Loom server, i.e. using the same distribution.
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/stopdatabase.sh
Stopped Database
14
4. Start the new Loom server. IMPORTANT: always invoke the loom-server.sh script from the
distribution directory, e.g. /usr/local/loom/loom-x.y.x-distribution directory. Loom has certain
dependencies that require it to be started from the distribution directory.
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ nohup ./bin/loomserver.sh &
[hit ENTER to regain command-line access]
a. To run this server on a different port, simply specify the port when starting the Loom
server.
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/loom-server.sh
<port#>
# Example
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/loom-server.sh
8081
5. Do not log into the Lab Bench or attempt to view or register data before finishing the next
section.
Restore Registry
1. From the new distribution directory, restore the registry, using the backup.json file you created
with the previous distribution. By default, <host>=localhost and <port>=8080.
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/restore.sh
/usr/local/loom/loom-x.y.z-distribution/<backupfile> -h <host> -p <port>
Congratulations! You have now updated Loom.
15
Advanced Configuration
Security, User Impersonation, and Authentication
1. See loom-x.y.x-distribution/docs/Loom_Security.txt for details. You will need to restart Loom
after making any Loom configuration changes, and restart Hadoop services after making any
Hadoop configuration changes.
ActiveScan: Potential Sources
1. One of Loom’s features is the ability to detect “Potential Sources;” that is, regularly and
recursively scan a specified HDFS directory to detect new files, which Loom displays in the
‘Sources’ Home page of the Loom Lab Bench (browser UI), as well as on the ‘Loom’ home page
in the ‘Recent Sources’ column.
2. To turn on ActiveScan: Potential Sources, edit loom-x.y.z-distribution/config/loom.properties:
loom-x.y.z-distribution/config/loom.properties
# Enable active scanning of potential datasets in HDFS.
activeScan.dataset.enabled=true
# Set the top-level directory under which to scan for potential datasets
# in HDFS. May be specified as an absolute hdfs:// URL or a relative
# path that will be resolved against the Loom working directory.
# Defaults to the Loom working directory.
activeScan.dataset.baseDir=<HDFSdirectory>
a. Example configurations
activeScan.dataset.baseDir=hdfs://node:8020/home/loomuser/loomInput
ACCEPTABLE
16
activeScan.dataset.baseDir=/home/loomuser/loomInput
ACCEPTABLE
activeScan.dataset.baseDir=loomInput ACCEPTABLE, if loomuser has a
configured working directory
3. By default, Loom is set to scan the specified directory every 60 minutes, but you can change this:
loom-x.y.z-distribution/config/loom.properties
activeScan.dataset.scanIntervalMinutes=<integer>
4. You can also determine the size of the sample Loom will scan from each file, in terms of either
number of rows (activeScan.hdfs.parseLines) or number of bytes
(activeScan.hdfs.maxBufferSize). Loom will stop scanning as soon as it reaches one
of those limits.
# The number of records to parse from a file in HDFS to determine
whether it's a potential source.
activeScan.hdfs.parseLines=50
# The maximum amount of data to read into memory from an HDFS file to
determine whether it's a potential source.
activeScan.hdfs.maxBufferSize=8388608
5. Once configuration changes have been made, start or restart the Loom server. Changes will not
take effect otherwise.
Custom Metadata Properties
17
IMPORTANT: Read if you are restarting Loom and using custom metadata properties. If you meet both
of the following conditions: 1) You used the Custom Metadata feature of Loom, i.e. removed, edited, or
added properties to the CSV(s) in loom-x.y.z-distribution/schema and 2) you are planning to restore the
registry which you previously backed up, then you must copy the contents of loom-x.y.zdistribution/schema from the old Loom distribution directory into the new directory. Otherwise, Loom
will not be able to restore your registry due to a mismatch in registry structure.
loomuser@node:~/loom/loom-a.b.c-distribution$ rm schema/*
loomuser@node:~/loom/loom-a.b.c-distribution$ cp ../loom-a.b.cdistribution/schema/* schema/
Upon startup, Loom looks for CSVs in the directory loom-x.y.z-distribution/schema, and reads the
properties defined therein. In order to remove, edit, or create properties for a given class of entities, you
will need to edit the CSVs in loom-x.y.z-distribution/schema directory.
●
All CSVs must follow the naming format: meta-*.csv. For example: meta-user-extension.csv,
meta-customproperties.csv.
●
Each CSV must use the following schema:
Column Name
Description
Examples
type
The entity type that the property is
associated with.
Must be one of:
source/SourceExtension
dataset/DatasetExtension,
process/ProcessExtension,
job/JobExtension
meta.attribute/name
The unique name of the property.
geo/Location
finance/Rate
meta.attribute/valueType
The datatype for the property.
string, long, uri, uuid
18
meta.attribute/cardinality
Indicates whether the property refers to a
single value or a list.
Must be 'one' or 'many.'
meta.attribute.ref/type
Only use this property if
meta.attribute/valueType is set to ‘uuid,’
otherwise leave as null.
meta.attribute.ref/type indicates the type
of entity to which
meta.attribute/valueType refers.
dataset/Dataset
meta.attribute/unique
‘value’ or ‘identity’ indicates that this
property uniquely identifies the entity;
that is, no 2 entities can share the same
value for this property.
Must be null, 'value', or
'identity.'
meta.attribute/index
Indicates that this property should be
indexed for fast lookups.
Must be 'TRUE' or 'FALSE.’
meta.attribute/fulltext
Only use this property if
meta.attribute/valueType is set to ‘string,’
otherwise leave null. This property
indicates whether
meta.attribute/valueType should be
indexed for text searches. [Note: Support
for this feature is not included in Loom
1.1.3.]
Must be 'TRUE' or 'FALSE.'
meta.attribute/doc
The label that will be displayed in the Lab
Bench; a text string describing the
property.
Owned By
An example of correctly formatted custom properties for a Source, Dataset, Process, and Job:
loom-x.y.z-distribution/schema/meta-user-extension.csv
#Column Headers
type,meta.attribute/name,meta.attribute/valueType,meta.attribute/cardinality
,meta.attribute.ref/type,meta.attribute/unique,meta.attribute/index,meta.att
19
ribute/fulltext,meta.attribute/doc
#Custom Property for a Source
source/SourceExtension,source.extenstion/hasSink,boolean,one,,,FALSE,FALSE,H
as Sink?
#Custom Property for a Dataset
dataset/DatasetExtension,poc.docs/externalDocumentation,uri,many,,,FALSE,FAL
SE,External Documentation
#Custom Property for a Process
process/ProcessExtension,process.extension/requestedBy,string,one,,,FALSE,FA
LSE,Requested By
#Custom Property for a Job
job/JobExtension,job.extension/optimized,boolean,one,,,FALSE,FALSE,Optimized
For additional reading, see the README file in the Loom distribution folder (loom-x.y.zdistribution/README.txt)
20
Download