OTP dev-ops

feb 25

OTP dev-ops

Working with SASL logs is a bit different than what dev-ops may be used to. Instead of employing the typical strategies like tail -f and grep, OTP gives us some specific tools for looking into those pesky Erlang errors.

First, the log files generated by SASL aren’t plain-text files. So, think twice before you try to tail -f them at all; that’s not going to get you anywhere. Rather, you better go and read the manual for Erlang’s report browser application, ‘rb.’

Report browsing

First, let’s assume our OTP system is configured to log five MB of rotating log-files. Our system configuration file — ‘sys.config’ — can accommodate this by having the following:

[
  {sasl, [
    {sasl_error_logger, false},
    {error_logger_mf_dir, "/var/log/your-app"},
    {error_logger_mf_maxbytes, 5242880}, % 5 mb
    {error_logger_mf_maxfiles, 5}
  ]}
].

After our OTP system has chugged along for a couple days, we want to peer into the logs and see if any wires have been tripped. One way to ease this process is by automating a few things. In the directory that contains the rotating logs, I have a ‘Makefile’ in there so I can quickly start investigating. Here is said ‘Makefile’ :

.PHONY: default
default:
    erl -name wtf@your-app.com \
        -boot start_sasl \
        -eval "rb:start([{report_dir, \"/var/log/your-app\"}])"

After I cd to the log directory & type make, I am thrown into an Erlang shell with the report browser all set to go.

‘rb’ comes with some great tools. And when you become familiar with it, it feels similar to piping Unix commands around the shell when investigating logged errors in other systems.

Let’s look at a few of these ‘rb’ goodies. I’ll explain how they work & give you a sense for what to expect. The following is ordered by how often I use them.

Listing

Using rb:list() is the first command I run after firing up the shell. I usually look at the date and get a sense for the breadth of errors. Here is a sample of what you can expect:

15        crash_report        foo_srv 2016-02-15 11:36:13
14   supervisor_report    <7219.45.0> 2016-02-15 11:36:13
13            progress    <7219.45.0> 2016-02-15 11:36:13
12               error    <7219.45.0> 2016-02-17 00:47:23
11        crash_report        foo_srv 2016-02-17 00:47:23
10   supervisor_report    <7219.45.0> 2016-02-17 00:47:23
 9            progress    <7219.45.0> 2016-02-17 00:47:23
 8               error    <7219.45.0> 2016-02-17 19:31:28
 7        crash_report        foo_srv 2016-02-17 19:31:28
 6   supervisor_report    <7219.45.0> 2016-02-17 19:31:28
 5            progress    <7219.45.0> 2016-02-17 19:31:28
 4               error    <7219.45.0> 2016-02-19 11:19:37
 3        crash_report        foo_srv 2016-02-19 11:19:37
 2   supervisor_report    <7219.45.0> 2016-02-19 11:19:37
 1            progress    <7219.45.0> 2016-02-19 11:19:37

You can see that there are some errors, the last being from the 19th of February. If you have quite a lot of space between errors, you can gather on the report-type by using rb:list(error). Of course, the parameter can be one of the many types of reports, not just ‘error.’

Reading

So, the browser shows errors. Let’s take a look at the latest one by passing ‘4’ to the ‘show’ routine; ‘4’ in this case is the ID of the report. This is what rb:show(4) yields:

…

** Generic server foo_srv terminating 
** Last message in was {do_timed_work,infinity}
** When Server state == {state,11,0}
** Reason for termination == 
** {{badmatch,{error,socket_closed_remotely}},
    [{foo_srv,do_timed_work,1,
      [{file,"src/foo_srv.erl"},{line,143}]},{foo_srv,handle_info,2,[{file,"src/foo_srv.erl"},{line,102}]},
     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,604}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
ok

Without too much trouble, we can see that this error happened because of {badmatch,{error,socket_closed_remotely}}; easy. Now, how about the error before this? Let’s take a look via rb:show(8):

…

** Generic server foo_srv terminating 
** Last message in was {do_timed_work,infinity}
** When Server state == {state,16,0}
** Reason for termination == 
** {{badmatch,{struct,[{status,500},
                       {error,internal_server_error}]}},
    [{foo_srv,do_timed_work,1,
                    [{file,"src/foo_srv.erl"},{line,144}]},
     {foo_srv,handle_info,2,[{file,"src/foo_srv.erl"},{line,102}]},
     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,604}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
ok

This time we notice it’s an error we don’t care about: this happens once in a blue moon and doesn’t hinder business — we put it on the long-list.

You may be wondering: how can we peruse the reports that solely contain {{badmatch,{error,socket_closed_remotely}}?

Grep’n’

If we want to see all the reports containing a certain word, then we can use the report browser’s ‘grep’ routine. For example, let’s drill down into the reports which contain {{badarg,{error,socket_closed_remotely}} errors:

rb:grep("socket_closed_remotely").

This will give us what we need. All the reports that match on this ‘string’ will now be at our fingertips.

Rescanning

In a system that is hot with logged information (perhaps the koan of ‘let it crash’ has taken your system hostage for a spell) we mustn’t forget that our log-browsing session gets old. To give the session a shake so we can see the latest, rb:rescan/0,1 will do the trick. You will notice some familiar output in your shell when you run it, from when you initially started the report browser.

Nota bene

Some things to keep in mind: First, rb:grep/1 is a little janky. Regular expressions that seem to work with Erlang’s ‘re’ module — not to be confused with Erlang’s rb module — simply don’t resolve as they should in rb:grep/1. You have been warned.

Secondly, there is no way to daisy-chain searches, like you can with tail -f in combination with Unix ‘grep’ — though this shouldn’t be surprising considering we aren’t dealing with OOP here.

Finally, a big red flag needs to be waved for the rb:filter/1,2 family of routines. ‘Filter’ is to be used only when employing error_logger:info_report/1,2, error_logger:warning_report/1,2 & error_logger:error_report/1,2. Also, since one can freely pass strings & arbitrary terms as the arguments, it should be observed that ‘filter’ is only useful in the case when we pass proplists, like the following:

error_logger:error_report([{foo, bar}, {baz, luhrmann}]).

You can see that now we have keys which we can match/filter on, making rb:filter/1,2 actually useful. If you don’t code your errors in this manner, you may not come around to ever needing the ‘filter’ routines.