RSS

Report on Prometheus Casual Talks in Tokyo and then toward PromCon 2016

by LINE Engineer on 2016.8.2


Introduction

Hello, my name is Wataru Yukawa. I work at LINE as a data engineer.

As a data engineer, my daily duties include using Fluentd to collect logs, Hadoop to accumulate, and Hive to aggregate and analyze logs. Our Hadoop cluster is medium-sized, consisting of 40 units and approximately 370TB of DFS used space. Data from LINE family apps is smaller compared to the LINE app. While it’s nowhere near large enough to be considered as big data, it still has many types of different data, Fluentd tags, and over 400 Fluentd processes due to the various LINE family services tied to it. The Fluentd data flow amounts to 150 thousand messages per second during peak times.

As much as monitoring for Hadoop and Fluentd is crucial to us, the monitoring tools available to us were less than ideal. Prompting us to look for better solutions. That’s when I learned that the storage development team for the LINE app uses Prometheus and Grafana for monitoring. Prometheus is a next generation monitoring system with a pull-based architecture and powerful query language. I was so impressed with what Prometheus was capable of, I repeatedly listened to one of its developers speaking on a podcast about its fundamentals.

In the end, I decided to incorporate Prometheus into my working environment. Seeing how many teams were doing the same, I thought it would be a good idea to gather the people using Prometheus in a meetup where we can share information with each other. This is how we began Prometheus Casual Talks, the first Prometheus meetup event in Japan. While we still have to improve speakers diversity, as 4 out of 5 speakers were LINE employees, I think it’s a testament to how popular Prometheus is inside the company.

Through this blog post, I’d like to share my thoughts from the Prometheus meetup. And then briefly talk about the upcoming PromCon 2016.

Prometheus Casual Talks #1

The first ever Prometheus meetup in Japan was held in our office cafe on June 14, 2016. The title was “Prometheus Casual Talks #1.” I didn’t expect many participants to show up, as Prometheus was not that prominent in Japan as of yet. But to my surprise, we had about 100 people in attendance. Including myself, there were 5 speakers for the event. Here are some brief summaries of the talks.

Hadoop, Fluentd cluster monitoring with Prometheus and Grafana

First up was “Hadoop, Fluentd cluster monitoring with Prometheus and Grafana” by myself. In the first half of my session, I gave a brief introduction of Prometheus. I focused on the pros and cons of Prometheus as a pull-based system, something rare in monitoring software. I also talked about the powerful query features. The latter half focused on how we actually use it in the company.

We created a custom exporter for Hadoop/Fluentd/Jstat and we use Prometheus and Grafana to monitor them. I mentioned it during my talk, but creating exporters isn’t difficult and I recommend everyone to try it out.

I was invited as a speaker to the first ever Prometheus conference called PromCon. More details on PromCon later on in the post.

Prometheus on AWS

Next up was “Prometheus on AWS” by mtanda. His talk was about how he uses Prometheus and Grafana to monitor services on AWS.

The life cycles of instances on AWS tend to be short and the host info can change at any time. That is why mtanda emphasized the importance of service discovery in order to alleviate the cumbersome part of using Prometheus caused by its pull-based architecture. In his environment, mtanda seems to use ec2_sd_configs to discover servers and monitor them.

As a Grafana committer, mtanda happens to be the original author of the Grafana plugin necessary to integrate Prometheus with Grafana. Without it, Prometheus only has a basic web UI, which is why many engineers commonly use Grafana for a dashboard. Honestly, if Grafana didn’t support Prometheus, I would have avoided using Prometheus. Grafana is a great dashboard tool and I believe that many Prometheus users feel the same way.

Promgen – Prometheus management tool

The third talk was “Promgen – Prometheus management tool” by tokuhirom, another LINE colleague.

Unlike mtanda, we use an on-premise environment instead of a cloud-based one. Which means the host info in our environment does not change often. This is why we don’t use service discovery tools such as Consul. Due to the nature of our environment, we need to write the servers pulled by the Prometheus server on file_sd_config. As it would be cumbersome if done manually, tokuhirom created Promgen: a Ruby web application used with Prometheus that can automatically add hosts to be monitored from the browser. To quote tokuhirom on why he decided to write it with Ruby, he simply said: “The weather was nice.” I also had plans to create a similar web application but decided to use the one tokuhirom wrote since he created it at lightning speed.

Promgen offers alert detection capabilities. Engineers can register their own alert rules on the browser. For example, the rule below would send an alert when the Hiveserver2 java heap exceeds 5GB.

ALERT Hiveserver2HeapCheck
  IF jstat_oldUsed{farm="...",instance="...:9010",job="jstat",project="..."} >= 5000000
  FOR 1m
  ANNOTATIONS {description="HiveServer2 of {{ $labels.instance }} uses java heap 
              more than 5GB"}

Promgen also has a webhook. It can send alert notifications to HipChat and email.

LINE recently revealed Promgen to the public as an open source project. Check this link for more details.

Monitoring Kafka with Prometheus

The fourth talk was by kawamuray, another one of my colleagues in charge of LINE storage development.

We use Prometheus and Grafana to monitor Kafka, the middleware that stores important service data from the LINE app. We check the Grafana dashboard when a problem is detected, and use PromQL to track each individual metric. In his talk, kawamuray talked about how he uses jmx_exporter with Kafka for monitoring, how he monitors yarn jobs that pass through pushgateway, and gave some tips on using Prometheus.

His Prometheus server’s machine boasted some very impressive specifications, making everyone in the audience jealous.

Implement h2o_exporter in 5 minutes

The last talk was by moznion, another one of my colleagues. As you can see in a previous blog post written by moznion, he is the engineer in charge of LINE LIVE.

Moznion, in a live coding session, demonstrated how to implement an h2o_exporter in 5 minutes. A truly impressive task. There are no slides to share as it was a live demonstration, but the completed h2o_exporter can be seen here.

If you want to know more about his work on LINE LIVE, be sure to check out this blog post.

PromCon 2016

PromCon, the first ever Prometheus conference is scheduled to take place on August 25-26, 2016 in Berlin, Germany.

I was given the honor of speaking at the first ever Prometheus conference! I’m so excited and I can’t wait until PromCon 2016. I’ll need to brush up on my English in the meantime.

My talk at PromCon 2016 will be similar to the talk I gave during Prometheus Casual Talks #1 but without the basics of Prometheus as most of the audience at PromCon will be familiar with it. Instead of the original introduction, I plan to talk about Promgen. As a guest speaker, I am looking forward to meeting many more fellow Prometheans.

As interest towards Prometheus is at an all-time high in the company, LINE Corporation is participating as a silver sponsor for the conference. I am looking forward to meeting you all there. Hope to see you all soon!