OTP dev-ops
Working with SASL logs is a bit different than what dev-ops may be
used to. Instead of employing the typical strategies like
tail -f
and grep
, OTP gives us some specific tools for looking
into those pesky Erlang errors.
First, the log files generated by SASL aren’t plain-text files.
So, think twice before you try to tail -f
them at all; that’s not going
to get you anywhere. Rather, you better go and
read the manual
for Erlang’s report browser application, ‘rb.’
Report browsing
First, let’s assume our OTP system is configured to log five MB of rotating log-files. Our system configuration file — ‘sys.config’ — can accommodate this by having the following:
[
{sasl, [
{sasl_error_logger, false},
{error_logger_mf_dir, "/var/log/your-app"},
{error_logger_mf_maxbytes, 5242880}, % 5 mb
{error_logger_mf_maxfiles, 5}
]}
].
After our OTP system has chugged along for a couple days, we want to peer into the logs and see if any wires have been tripped. One way to ease this process is by automating a few things. In the directory that contains the rotating logs, I have a ‘Makefile’ in there so I can quickly start investigating. Here is said ‘Makefile’ :
.PHONY: default
default:
erl -name wtf@your-app.com \
-boot start_sasl \
-eval "rb:start([{report_dir, \"/var/log/your-app\"}])"
After I cd
to the log directory & type make
, I am thrown into
an Erlang shell with the report browser all set to go.
‘rb’ comes with some great tools. And when you become familiar with it, it feels similar to piping Unix commands around the shell when investigating logged errors in other systems.
Let’s look at a few of these ‘rb’ goodies. I’ll explain how they work & give you a sense for what to expect. The following is ordered by how often I use them.
Listing
Using rb:list()
is the first command I run after firing up the shell.
I usually look at the date and get a sense for the breadth of errors.
Here is a sample of what you can expect:
15 crash_report foo_srv 2016-02-15 11:36:13
14 supervisor_report <7219.45.0> 2016-02-15 11:36:13
13 progress <7219.45.0> 2016-02-15 11:36:13
12 error <7219.45.0> 2016-02-17 00:47:23
11 crash_report foo_srv 2016-02-17 00:47:23
10 supervisor_report <7219.45.0> 2016-02-17 00:47:23
9 progress <7219.45.0> 2016-02-17 00:47:23
8 error <7219.45.0> 2016-02-17 19:31:28
7 crash_report foo_srv 2016-02-17 19:31:28
6 supervisor_report <7219.45.0> 2016-02-17 19:31:28
5 progress <7219.45.0> 2016-02-17 19:31:28
4 error <7219.45.0> 2016-02-19 11:19:37
3 crash_report foo_srv 2016-02-19 11:19:37
2 supervisor_report <7219.45.0> 2016-02-19 11:19:37
1 progress <7219.45.0> 2016-02-19 11:19:37
You can see that there are some errors, the last being from the 19th
of February. If you have quite a lot of space between errors, you can
gather on the report-type by using rb:list(error)
. Of course, the
parameter can be one of the many types of reports, not just ‘error.’
Reading
So, the browser shows errors. Let’s take a look at the latest one by passing ‘4’
to the ‘show’ routine; ‘4’ in this case is the ID of the report. This is
what rb:show(4)
yields:
…
** Generic server foo_srv terminating
** Last message in was {do_timed_work,infinity}
** When Server state == {state,11,0}
** Reason for termination ==
** {{badmatch,{error,socket_closed_remotely}},
[{foo_srv,do_timed_work,1,
[{file,"src/foo_srv.erl"},{line,143}]},{foo_srv,handle_info,2,[{file,"src/foo_srv.erl"},{line,102}]},
{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,604}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
ok
Without too much trouble, we can see that this error happened because
of {badmatch,{error,socket_closed_remotely}}
; easy. Now, how about the
error before this? Let’s take a look via rb:show(8)
:
…
** Generic server foo_srv terminating
** Last message in was {do_timed_work,infinity}
** When Server state == {state,16,0}
** Reason for termination ==
** {{badmatch,{struct,[{status,500},
{error,internal_server_error}]}},
[{foo_srv,do_timed_work,1,
[{file,"src/foo_srv.erl"},{line,144}]},
{foo_srv,handle_info,2,[{file,"src/foo_srv.erl"},{line,102}]},
{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,604}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
ok
This time we notice it’s an error we don’t care about: this happens once in a blue moon and doesn’t hinder business — we put it on the long-list.
You may be wondering: how can we peruse the reports that solely
contain {{badmatch,{error,socket_closed_remotely}}
?
Grep’n’
If we want to see all the reports containing a certain word, then we can
use the report browser’s ‘grep’ routine. For example, let’s drill down
into the reports which contain
{{badarg,{error,socket_closed_remotely}}
errors:
rb:grep("socket_closed_remotely").
This will give us what we need. All the reports that match on this ‘string’ will now be at our fingertips.
Rescanning
In a system that is hot with logged information (perhaps the koan
of ‘let it crash’ has taken your system hostage for a spell)
we mustn’t forget that our log-browsing session gets old. To give
the session a shake so we can see the latest, rb:rescan/0,1
will do
the trick. You will notice some familiar output in your shell when you
run it, from when you initially started the report browser.
Nota bene
Some things to keep in mind:
First, rb:grep/1
is a little janky. Regular
expressions that seem to work with
Erlang’s ‘re’ module — not to be
confused with Erlang’s rb module — simply don’t resolve as they
should in rb:grep/1
. You have been warned.
Secondly, there is no
way to daisy-chain searches, like you can with tail -f
in
combination with Unix ‘grep’ — though
this shouldn’t be surprising considering we aren’t dealing with
OOP here.
Finally, a big red flag needs to be waved for the rb:filter/1,2
family of routines. ‘Filter’
is to be used only when employing
error_logger:info_report/1,2
, error_logger:warning_report/1,2
& error_logger:error_report/1,2
. Also, since one can freely pass
strings & arbitrary terms as the arguments, it should be observed that
‘filter’ is only useful in the case when we pass proplists, like
the following:
error_logger:error_report([{foo, bar}, {baz, luhrmann}]).
You can see that now we have keys which we can match/filter on, making
rb:filter/1,2
actually useful. If you don’t code your errors in this
manner, you may not come around to ever needing the ‘filter’ routines.