set up hadoop

advertisement
小端编译 hadoop 和 spark
set up hadoop:
Prerequisite: jdk1.7.0_79 pre-installed
hadoop-2.7.1 is pre-installed in ~/tmp/hadoop-2.7.1
Reference:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-s
ingle-node-cluster/
Steps:
1.Sun Java 6: skip
2.Adding a dedicated Hadoop system user: skip
3.Configuring SSH: skip
4.Disabling IPv6:
open /etc/sysctl.conf in the editor of your choice and add the following
lines to the end of the file:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
You have to reboot your machine in order to make the changes take
effect.
You can check whether IPv6 is enabled on your machine with the
following command:
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means disabled
(that’s what we want).
5.Hadoop Installation:skip
6.Update $HOME/.bashrc:
Add the following lines to the end of the $HOME/.bashrc file:
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for
Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Some convenient aliases and functions for running Hadoop-related
commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
7.hadoop-env.sh:
Open hadoop-2.7.1/etc/hadoop//hadoop-env.sh in the editor .Change:
# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/j2sdk1.5-sun
To:
# The java implementation to use. Required.
export/lib/jvm/java-1.7.0-openjdk-1.7.0.79-2.5.5.2.rs1.ppc64le/jre
8.*-site.xml:
Add
the
following
snippets
between
the
<configuration>
...
</configuration> tags in the respective configuration XML file.
In file hadoop-2.7.1/etc/hadoop//core-site.xml:
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
In file hadoop-2.7.1/etc/hadoop/mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
In file hadoop-2.7.1/etc/hadoop/hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is
created.
The default is used if replication is not specified in create time.
</description>
</property>
9.Formatting the HDFS filesystem via the NameNode:
The first step to starting up your Hadoop installation is formatting the
Hadoop filesystem which is implemented on top of the local filesystem of
your “cluster” (which includes only your local machine if you followed
this tutorial). You need to do this the first time you set up a Hadoop
cluster.
Do not format a running Hadoop filesystem as you will lose all the data
currently in the cluster (in HDFS)!
To format the filesystem (which simply initializes the directory specified
by the dfs.name.dir variable), run the command
/usr/local/hadoop/bin/hadoop namenode -format
10.Starting your single-node cluster:
Run the command:
~/tmp/hadoop-2.7.1/sbin/start-all.sh
A nifty tool for checking whether the expected Hadoop processes are
running is jps (needs to be installed).
11.Stopping your single-node cluster:
Run the command
~/tmp/hadoop-2.7.1/sbin/stop-all.sh
to stop all the daemons running on your machine.
12.Running a MapReduce job:
Download example input data:
mkdir /tmp/gutenberg
cd /tmp/gutenberg
wget http://www.gutenberg.org/etext/5000
Restart the Hadoop cluster:
~/tmp/hadoop-2.7.1/sbin/start-all.sh
Copy local example data to HDFS:
./bin/hadoop dfs -mkdir /user
./bin/hadoop dfs -mkdir /user/hduser
13.Run the MapReduce job
Now, we actually run the WordCount example job.(before doing
this,check existance)
./bin/hadoop
jar
hadoop*examples*.jar
/user/hduser/gutenberg /user/hduser/gutenberg-output
End
wordcount
B.set up spark:
Prerequisite: jdk1.7.0_79 pre-installed
Reference:
http://blog.prabeeshk.com/blog/2014/10/31/install-apache-spark-on-ubunt
u-14-dot-04/
Steps:
1.install Java:skip
2.check the Java installation is successful:skip
3.download spark:(this step may not be useful)
cd ~/tmp
wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
sudo mkdir /usr/local/src/scala
sudo tar xvf scala-2.10.4.tgz -C /usr/local/src/scala/
vi ~/.bashrc
And add following in the end of the file
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH
restart bashrc
.bashrc
To check the Scala is installed successfully:
scala -version
It shows installed Scala version
Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
4.download the pre-built spark for hadoop:
Since building was not successful, I downloaded a pre-built one instead.
wget
http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop2.4.tgz
tar xvf spark*tgz
cp -r java java-6-sun
cd ~/tmp/spark-1.1.1-bin-hadoop2.4
5.run Spark interactively:
./bin/spark-shell
6.run MQTT interactevely:skip
7.build source package:
~/sbt clean
8.set the SPARK_HADOOP_VERSION variable:skip
9.read and write data into cdh4.3.0 clusters:skip
end
Test spark:
Examples
can
be
found
at
http://spark.apache.org/docs/latest/quick-start.html
Basics:
Start it by running the following in the Spark directory:
./bin/spark-shell
make a new RDD from the text of the README file in the Spark source
directory:
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
start with a few actions:
scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
use the filter transformation to return a new RDD with a subset of the
items in the file.
scala>
val
linesWithSpark
=
textFile.filter(line
=>
line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
We can chain together transformations and actions:
scala> textFile.filter(line => line.contains("Spark")).count() // How
many lines contain "Spark"?
res3: Long = 15
More on RDD Operations:
find the line with the most words:
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b)
a else b)
res4: Long = 15
use Math.max() function to make this code easier to understand:
scala> import java.lang.Math
import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b) =>
Math.max(a, b))
res5: Int = 15
implement MapReduce flows
scala>
val
wordCounts
=
textFile.flatMap(line
=>
line.split("
")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts:
spark.RDD[(String,
Int)]
=
spark.ShuffledAggregatedRDD@71f027b8
To collect the word counts in our shell, we can use the collect action:
scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3),
(Because,1), (Python,2), (agree,1), (cluster.,1), ...)
Caching:
mark our linesWithSpark dataset to be cached:
scala> linesWithSpark.cache()
res7: spark.RDD[String] = spark.FilteredRDD@17e51082
scala> linesWithSpark.count()
res8: Long = 15
scala> linesWithSpark.count()
res9: Long = 15
run SparkPi example:
cd ~/tmp/spark-1.1.1-bin-hadoop2.4/bin
./run-example SparkPi 10
Self-Contained Applications:
1.install sbt:
download at : http://www.scala-sbt.org/
Install reference :
http://www.scala-sbt.org/0.13/tutorial/Manual-Installation.html
cd ~/bin
wget
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.
13.8/sbt-launch.jar
vi sbt
write in file:#!/bin/bash
SBT_OPTS="-Xms512M
-Xmx1536M
-XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256M"
java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar "$@" "
chmod u+x ~/bin/sbt
2.write program:
cd ~/tmp/spark-1.1.1-bin-hadoop2.4
mkdir project
cd project
vi simple.sbt
write in file:
name := "Simple Project
-Xss1M
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" %
"1.4.1""
mkdir src
cd src
mkdir main
cd main
mkdir scala
cd scala
vi SimpleApp.scala
write in file:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val
logFile
=
"/root/tmp/spark-1.1.1-bin-hadoop2.4/README.md" // Should be some
file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs,
numBs))
}
}
3.make package
cd ../../../../
~/sbt package
4.Submit
../bin/spark-submit
--class
"SimpleApp"
target/scala-2.10/simple-project_2.10-1.0.jar
run wordCount with eclipse:
--master
local[4]
1.build environment
mkdir ~/da
cd ~/da
vi README.md
write anything in file
vi submit.sh
write in file:
#!/bin/bash
path=/root/tmp/spark-1.1.1-bin-hadoop2.4/
libs=$path/lib
$path/bin/spark-submit \
--class "SimpleApp.SimpleApp" \
--master local[4] \
--jars
$libs/datanucleus-api-jdo-3.2.1.jar,$libs/datanucleus-core-3.2.2.jar,$libs/
spark-examples-1.1.1-hadoop2.4.0.jar,$libs/datanucleus-rdbms-3.2.1.jar,
$libs/spark-assembly-1.1.1-hadoop2.4.0.jar\
/root/da/SimpleApp.jar
2.write program in eclipse
in eclipse-scala IDE:
file>new>scala project "SimpleApp"
SimpleApp>src>new>package "SimpleApp"
SimpleApp>src>SimpleApp>new>scala object "SimpleApp"
write in file:
package SimpleApp
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "README.md" // Should be some file on your
system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val textFile = sc.textFile(logFile)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _).count()
println("number of words: %s".format(counts))
}
}
project>properties>java build path>libraries>add external jars
select all files in /root/tmp/spark-1.1.1-bin-hadoop2.4/lib
file>save
file>export>jar file with name "SimpleApp.jar"
put SimpleApp.jar in~/da
3.Run:
cd ~/da
.submit.sh
run character Count:
in eclipse-scala IDE:
in SimpleApp>src>SimpleApp>new>SimpleApp:
Replace
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _).count()
With
val counts = textFile.flatMap(line => line.split(" "))
.map(word => word.length)
.sum
file>save all
file>export>jar file with name "SimpleApp.jar"
put SimpleApp.jar in ~/da
cd ~/da
.submit.sh
Download