ElasticSearch in the Cloud

Setup Elasticsearch and Kibana on Amazon EMR

Install EMR CLI tool
Create a script below to:
- install elasticsearch and kibana on Amazon EMR using Amazon EMR's bootstrap action feature.
- install maven to compile java code
- this script will run on all nodes in the new cluster.
- this script run before Hadoop is configured or started and nodes start processing data.
- After this, elasticsearch is listening on port 9200 and kibana is listening on port 80. By default, these ports are protected by public access.
- To access these interfaces on the master node, do one of the following: SSH tunnel to the master node and browse proxies, or add a rule to allow incoming TCP traffic on port 80 and port 9200 on the EC2 Security Group “ElasticMapReduce-master”.

aws emr create-cluster --ec2-attributes KeyName= "<YOUR_EC2_KEYNAME>" \
--log-uri= "<YOUR_LOGGING_BUCKET>" \
--bootstrap-action \
Name="Install Maven",Path=s3://support.elasticmapreduce/bootstrap-actions/other/maven-install.sh \
Name="Install ElasticSearch",Path="s3://beta.elasticmapreduce/bigdatablog/elasticsearch_install.rb" \
Name="Installkibanaginx",Path="s3://beta.elasticmapreduce/bigdatablog/kibananginx_install.rb" \
--ami-version=3.2.1 \
--instance-count=3 \
--instance-type=m1.medium \
--name="TestElasticSearch" \
--use-default-roles \
--no-auto-terminate

Run big data job

Cascading

Cascading is an application development platform for building data applications on Apache Hadoop. You can use it to build application that indexes JSON files in ElasticSearch without the need to think in terms of MapReduce methods.

//ssh to the master node and do the followings:

$ sudo yum install git
$ git clone https://github.com/awslabs/aws-big-data-blog.git
$ cd aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch
$ mvn clean && mvn assembly:assembly -Dmaven.test.skip=true  -Ddescriptor=./src/main/assembly/job.xml -e

Once you have done the above steps, compiled application will be placed in the folder: aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target

//use hadoop to index a single file from WARC
hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/CC-MAIN-20141224185923-00099-ip-10-231-17-201.ec2.internal.warc.wat.gz

The application writes each JSON entry directly into Elasticsearch using the Cascading and Hadoop connectors.
After it runs, you can check indices and mapping file on Elasticsearch.

//list all indices:
curl 'localhost:9200/_cat/indices?v'

//view the mappings:
curl -XGET 'http://localhost:9200/_all/_mapping' | python -m json.tool |more

You can use kibana to query the indexed content

Reference

http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch