Making a secure elasticsearch on openshift with data from MySQL -- Charlie Harvey

As part of my role at New Internationalist I have a monthly reading day. This month I was researching Elasticsearch. This post summarizes the functional prototype setup I built to learn a little more about the elasticsearch technology. It also gave me a chance to have a play with RedHat’s OpenShift PaaS. Which was a rather pleasant experience.

Elasticsearch

Elastic search is a distributed, scalable search engine with a REST API, written in Java and using Lucene. It is free software. In this post I will do 4 things

Detail setting up an Elasticsearch instance on Openshift
Securing the Elasticsearch instance as much as is possible on the free Openshift PaaS, that is stop non-authenticated users from writing to the database. Big caveat: this is done with Basic Auth and not over SSL. SSL can be used, but not on the free Openshift platform.
Cover importing data from a MySQL database into Elasticsearch
Make a quick demo of elastic search using jquery to make an interactive search box

Part 1: Setting up an Elasticsearch instance on Openshift

This process was documented for an older version of Openshift on the Openshift blog and there was also a youtube video available, again for an older version.

Setting up Openshift

To start off with I set up an openshift account. That is straightforward and I won’t bother documenting the process.

Once I had done that I made a "Do It Yourself 0.1" app, by clicking the DIY app item at the bottom right of the add application page. I called it elastic, and was asked to choose a namespace — charlieharvey seemed appropriate.

On the settings page, I uploaded my public key by clicking "add a new key" and cutting and pasting the text from ~/.ssh/id_rsa.pub.

I got confused about where to find the git clone link, but eventually found it on the right hand side of the application’s page. I copied it and got rid of the filepath and the leading ssh:// so i had something of the form: somerandomstring.elastic-charieharvey.rhcloud.comAt this stage I could ssh in and got a nice console warning.$ ssh 5423e5f@elastic-charlieharvey.rhcloud.com Enter passphrase for key '/home/charlie/.ssh/id_rsa': ********************************************************************* You are accessing a service that is for use only by authorized users. If you do not have authorization, discontinue use at once. Any use of the services is subject to the applicable terms of the agreement which can be found at: https://www.openshift.com/legal ********************************************************************* Welcome to OpenShift shell This shell will assist you in managing OpenShift applications. !!! IMPORTANT !!! IMPORTANT !!! IMPORTANT !!! Shell access is quite powerful and it is possible for you to accidentally damage your application. Proceed with care! If worse comes to worst, destroy your application with "rhc app delete" and recreate it !!! IMPORTANT !!! IMPORTANT !!! IMPORTANT !!! Type "help" for more info. [elastic-charlieharvey.rhcloud.com 5423e5f]\>

Setting up Elasticsearch

So the next step was to get Elasticsearch running. Openshift uses various environment variables for config and one of these — $OPENSHIFT_DATA_DIR — is the place to put your Elasticsearch instance. I went to that directory, grabbed a tarball from the Elasticsearch site, untarred it, moved the files into a more sensibe place and cleaned up after myself. Which went like this cd $OPENSHIFT_DATA_DIR wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.2.tar.gz tar xvzf elasticsearch-1.3.2.tar.gz rm elasticsearch-1.3.2.tar.gz mv elasticsearch-1.3.2/* . rm -r elasticsearch-1.3.2

Now I needed to do some config. I opened config/elasticsearch.yml with vim and prepended the following basic setup to the commented out file network.host: ${OPENSHIFT_DIY_IP} transport.tcp.port: 8090 http.port: ${OPENSHIFT_DIY_PORT} discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: []

I checked that elastic search wasn’t running and started it in foreground mode with $ cd $OPENSHIFT_DATA_DIR $ bin/elasticsearch It started fine so I killed it with a ctrl-c. A warning had popped up, which went away after doing $ execstack -c /var/lib/openshift/5423e5f45973caaf2700010e/app-root/data/lib/sigar/libsigar-x86-linux.so

In order to make the app start and stop as expected, Openshift uses hooks. I next edited the start and stop hooks, at ./app-root/runtime/repo/.openshift/action_hooks/start and ./app-root/runtime/repo/.openshift/action_hooks/stop. Start became nohup $OPENSHIFT_DATA_DIR/bin/elasticsearch -d Stop was a little more complex if [ -z "$(ps -ef | grep elasticsearch | grep -v grep)" ] then client_result "application is already stopped" else kill `ps -ef | grep elasticsearch | grep -v grep | awk '{ print $2 }'` > /dev/null 2>&1 fi

Now, I had a working Elasticsearch instance which I could verify by visiting my app root in my browser, where I saw { "status" : 200, "name" : "Justin Hammer", "version" : { "number" : "1.3.2", "build_hash" : "dee175dbe2f254f3f26992f5d7591939aaefd12f", "build_timestamp" : "2014-08-13T14:29:30Z", "build_snapshot" : false, "lucene_version" : "4.9" }, "tagline" : "You Know, for Search" }

After restarting the app a couple of times on the Openshift console, I added some data with CURL curl -XPUT "http://elastic-charlieharvey.rhcloud.com/movies/movie/1" -d' { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }' {"_index":"movies","_type":"movie","_id":"2","_version":2,"created":true} I found the Elasticsearch API pretty straightforward and well documented.

Part 2: Making it (sort of) secure

By default, openshift is insecure. Annybody can write to any index. You can secure it using a reverse proxy like nginx. But running nginx sounded like a faff, so I looked at two plugins that offer better security. My initial favourite Elastic search security plugin lets you lock things down with various degrees of granularity. Unfortunately it didn’t seem to be playing nicely with the Openshift environment.

So I gave up and moved on to Elasticsearch-jetty, which replaces the default HTTP transport with Jetty. Because it uses Jetty to handle the HTTP transport, it can handle SSL connections as well as be configured to do basic authentication.

The plugin system in Elasticsearch is pretty nice, at least for installing things. I just did $ cd $OPENSHIFT_DATA_DIR $ bin/plugin -url https://oss-es-plugins.s3.amazonaws.com/elasticsearch-jetty/elasticsearch-jetty-1.2.1.zip -install elasticsearch-jetty-1.2.1 And was rewarded with -> Installing elasticsearch-jetty-1.2.1... Trying https://oss-es-plugins.s3.amazonaws.com/elasticsearch-jetty/elasticsearch-jetty-1.2.1.zip... Downloading .......................................................................................................................................DONE Installed elasticsearch-jetty-1.2.1 into /var/lib/openshift/5423e5/app-root/data/plugins/jetty-1.2.1

Configuring elasticsearch-jetty

My Elasticsearch instance couldnt initially find the xm files with which elasticsearch-jetty is configured, so I moved them up a level $ cd $OPENSHIFT_DATA_DIR $ cp config/jetty-1.2.1/*xml config

Now I needed to tell Elasticsearch to use the new plugin so I fired up vim on config/elasticsearch.yml script.disable_dynamic: true http.type: com.sonian.elasticsearch.http.jetty.JettyHttpServerTransportModule sonian.elasticsearch.http.jetty: config: jetty.xml,jetty-hash-auth.xml,jetty-restrict-writes.xml bind_host: ${OPENSHIFT_DIY_IP}

You will note that this is not SSL, so malicious users could read the passwords as they went over the wire — I’m looking at you GCHQ! The plugin does let you do SSL, it just needs a slightly different config, as described in the documentation. Note also that I had to tell it to bind to the correct IP. This is probably a quirk of Openshift, it would try to bind to all interfaces by default.

Another security note. There is reference in the docs to being able to md5 your password, but wasn't working for me, I just used plain text. Which is also insecure. The password is added to a text file called config/realmproperties thus root: secretpassword,admin,readwrite

After restarting the app, I could still do a search, but not do a put $ curl -XGET "http://elastic-charlieharvey.rhcloud.com/movies/movie/_search"{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"movies","_type":"movie","_id":"eSnwyk7IS5ygSRGXPfqhEQ","_score":1.0,"_source":{ "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }, },{"_index":"movies","_type":"movie","_id":"0NPUHFp2R3q9IhvtyRpZgQ","_score":1.0,"_source":{ "title": "A.N.Other film","director": "unknown","year":"2014"}}]}} $ curl -XPUT "http://elastic-charlieharvey.rhcloud.com/movies/movie/2" -d' { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }'However, if I passed some auth details to curl, I could add new movies. Result. $ curl -XPUT --user root:secretpassword "http://elastic-charlieharvey.rhcloud.com/movies/movie/5" -d '{ "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }' {"_index":"movies","_type":"movie","_id":"5","_version":1,"created":true}

Populating Elasticsearch with data from MySQL

At this stage, I had an Elasticsearch instance that was secure from being written to but still usable for searching. Next step was populating the data. I read a little about the bulk update API function and decided I would do it all in one go. I used the MySQL database from this very website to pull the data from.

I had briefly considered using the charlieharvey content API, but I though the the MySQL approach would be more easily generalized to other tasks. Like any good perl programmer, I started out by looking at CPAN. And there is a module that did most of what I needed bar a couple of regexen, DBIx::JSON. PHP users could follow this tutorial on how to get JSON from MySQL in PHP or for the really dedicated, you can also get JSON out of MySQL directly with no programming language involved. Cool.

I installed DBIx::JSON with a quick $ sudo cpanm DBIx::JSON

What I was after wasn’t quite the plain vanilla JSON of the data, which might have looked like this { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }, { "title": "A.N.Other film","director": "unknown","year":"2014"}, …I needed it interspersed with Elastic search statements like this {"create" : {"_index":"movies","_type":"movie"}} { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }, {"create" : {"_index":"movies","_type":"movie"}} { "title": "A.N.Other film","director": "unknown","year":"2014"}

Here is the final perl script that I used to pull the titles and links from my blog from MySQL and into Elasticsearch. Feel free to tweak as needed. #!/usr/bin/perl use strict; use warnings; use DBIx::JSON; my $dbname = 'dbuser'; my $host = 'mydbhost'; my $port = '3306'; my $dbusername = 'charlieharvey'; my $dsn = "dbname=$dbname;host=$host;port=$port"; my $obj = DBIx::JSON->new($dsn, "mysql", $dbusername, 'dbpassword'); $obj->do_select("select id,title,CONCAT('http://charlieharvey.org.uk/page/',`slug`) AS link from pages;","id",1); $obj->err && die $obj->errstr; my $json = $obj->get_json; $json =~ s/^\[//; $json =~ s/\]$//; $json =~ s/},/}\n{"create" : {"_index":"pages","_type":"page"}}\n/g; say '{"create" : {"_index":"pages","_type":"page"}}'. "\n$json";

I ran that and piped it into a file $ perl charlieharvey.org.uk_to_elasticsearch.pl > pages.json Then I could use Elastic Search’s bulk import functionality and Curl’s ability to read from a file to populate my elastic search from that file thus $ curl -XPUT --user root:pwd "http://elastic-charlieharvey.rhcloud.com/pages/_bulk" --data-binary @pages.json

Part 4: The demo

So now I have a cloud-hosted elastic search instance, running on Openshift with a bunch of data on it. All that remains is to wrap it in a ittle jquery and make a search box. The code is in the body of the page. It’s basic, but have a play. Hint: Perl is a thing to search for that brings back some results.

Thanks for reading this far and have fun playing with Elasticsearch and openshift.