To run a container with a shared folder (e.g. ~/Desktop/localFolder), listening on the port 8888. The localFolder is located on the desktop and you can use it to share file with the virtual machines, ipython notebook included.
docker run -d -p 8888:8888 -v ~/Desktop/localFolder/:/notebooks --name pyspark stravanni/cineca-spark
-d
deamon mode-p
posrt-v
volume--name
give a name to the containers
Open the brawser at localhost:8888
If you are on Mac, remember that the actual VB ip can be finded with boot2docker ip
.
While, if you want connect to the localhost
you need the following port forwarding for VBox:
(e.g. ports from 8880 to 8890)
for i in {8880..8890}; do
VBoxManage modifyvm "boot2docker-vm" --natpf1 "tcp-port$i,tcp,,$i,,$i";
VBoxManage modifyvm "boot2docker-vm" --natpf1 "udp-port$i,udp,,$i,,$i";
done
To get info about the virtual machine where the containers run:
boot2docker info
To change the memory of the VirtualMachine (i.e. VBox)
BoxManage modifyvm boot2docker-vm --memory 4096
docker ps
shows acrive conainersdocker ps -a
shows all containersdocker restart CONTAINER-ID
restarts a containerdocker stop 'docker ps -aq'
stops all containersdocker rm 'docker ps -aq'
removes all conainers
Launching the container the first command issued is:
IPYTHON_OPTS="notebook --no-browser --ip=0.0.0.0 --port 8888" /usr/local/spark/bin/pyspark
The IPython notebook will already have the sparkContext variable sc
.
Write sc.version
to see what verison is loaded.
To read a file directly from the disk (no HDFS), use explicitly:
sc.textFile("file:///absolute_path to the file/")
SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits, which in turn uses org.apache.hadoop.fs.getDefaultUri if schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf.
If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".