Skip to main content

Improving the development experience: Jupyter for Elasticsearch

I have been asked to extract data from Elasticsearch and I had no clue of what kind of data were stored inside of it. Moreover, there was a lack of documentation and the only thing that could help me were some old Python scripts written by an old colleague.

Fortunately, Python is pretty damn good when it comes to readability.

So, because I am working to improve the development experience of my team and I am messing around with Jupyter, I thought I could create a Jupyter notebook with a super simple Elasticsearch client in Python 3 that supports a connection with a proxy (vital, if working in a company).

Advantages for my team

Increased visibility and knowledge sharing on Elasticsearch and how to handle quickly and efficiently its data. Now with Jupyter and this notebook they can connect immediately to our internal Elasticsearch, extract data and play with cool Python libraries for data analysis such as pandas, numpy and machine learning like Tensorflow. If before the data analysis in my team was managed by a single person (who then quit), who became the big data guy, now all the team is up and running and can do the same task quickly.

Also, with Jupyter they can execute and see the result in the same view, step by step, for every line of code they write, so reducing dramatically the time spent to develop any python script that we need for data analysis in the team. Before it was kind of trial and error until it worked.

Another great advantage is that we can now include code and text in the same notebook, extract it as markdown and store it into our internal Bitbucket (we don’t have the notebook viewer plugin). They will become an incredible efficient and effective documentation that will help the team to improve its data analysis skills.

Next steps

Jupyter is great. My teammates were just blown away when they saw it and how it can improve our development workflow for data analysis tasks. Now the next step is to organize workshops or coding dojos to practice with Pandas, Numpy and Tensorflow. Also, it would be cool to install Jupyter Hub so that we can have a common environment where to share notebooks. Fortunately, there is also a Docker image for it:

docker run -d --name jupyterhub jupyterhub/jupyterhub jupyterhub

(Source - the usual suspect - for the proxy support: StackOverflow).

The Jupyter notebook is on my github, HERE.

Here the Python script: