Graphing Access Log Status with PNP4Nagios, part two
The graphs I showed in part one are generated by performance output from a Perl script, check_access_log.pl (attached to this post). The performance data are then captured by pnp4nagios and saved into an RRD file. The default graphing templates provided with pnp4nagios v0.4 didn’t do justice to how interest these data are, so I wrote my own template (to be attached to a future post).
The usage for check_access_log.pl is:
Usage: check_log3.pl [ -v|--verbose ] -l|--logfile=file -m|--m=minutes [ -c|--critical=<threshold>(20) ] [ -w|--warning=<threshold>(10) ] [ -a|--activity=number_of_lines (100) ]
where:
logfileis the path to your httpd access logminutesis how far back from the present you want to parse your log file (5 minutes by default)-c|-wrepresent the percentage of non-200 statuses are considered problematic--activity: only go to warning/critical if activity exceeds this number of lines per check. This prevent the check from going to warning/critical if you’ve taken it out of service.
The script requires:
- File::ReadBackwards; # allows the script to read backwards until the time period criterion is met
- Date::Manip;
- Nagios::Plugin;
Generally, the live httpd access logs will not be on your Nagios server so you’ll need to invoke this over NRPE.
Graphing Access Log Status with PNP4Nagios, part one
When I started at $WORK we had little by way of a comprehensive monitoring system, and rectifying that quickly became my top priority. I rolled out Nagios, and needing a way to trend historical performance data, added “pnp4nagios”http://docs.pnp4nagios.org/pnp-0.4/start” on top of that. Over time I had a real desire to see what were the breakdown of 200s, 403s, 404s, 500s and other status codes being logged by our various web servers. What I wanted were graphs like this (oops, no traffic on the 20th),
or something that illustrated that it was okay when our backend Day Communique CQ5.3 publisher was running 80% error rates late in the afternoon (because our front-ends had cached all the 200 OK objects, but the 404s were still getting passed to the backend; in the morning, the publications flushed the front-end cache, and most 200 would go to the backend again):

Anyhow, setting up PNP4Nagios to generate these lovely, zoomable graphs will be the focus of my next post.
Bringing Devops to my Day Job
I’ve been following the "DevOps movement", since I was first turned on to Paul Hammond and John Allspaw’s "Velocity 2009 talk: 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr":velocityconference.blip.tv/file/2284377/. Many of the precepts of their talk: infrastructure automation, automated deploys, version control, and shared metrics, have been part of my approach to system administration for some years now (albeit never to the extent that Flickr embraces these matters). So much of what they say, and what I see elsewhere in DevOps from following Chef and Puppet, resonates with an approach that I’m naturally tuned to. Not so much my workplace, despite nearly two years of quiet advocacy. Now, however, my workplace has come through the latest major release with more interest in evaluating how we do things than was the case during this last year’s headlong rush to a new architecture and infrastructure. So I scheduled a lunchtime session to view and discuss Hammond and Allspaw’s talk, and was pleased to have over two dozen people show up, and another handful call in to the conference line. Further, our senior VP watched the talk while he was on travel and responded that we should take seriously the direction in which Allspaw and Hammond point. I have no doubt the cultural hurdles will be harder to vault than the technological ones. But I have loved my work the best when I’ve been at the table with the developers from inception to deployment, and seen how well that works, so I’ll take heart from the positive response to this first small step. I’ll keep you posted.
Swept away at Surge
Last week I attended Surge up in Baltimore, and for the first time in over a decade I had the exhilarating sensation of being out of my depth. Since it was a small conference with some of the big players in web-scale applications and content delivery, there’s good reason for that, but it inspires me to pick up the pace in following some technologies I’ve only been tangentially aware of.
The most interesting talks were:
- Scalable Design Patterns, Theo Schlossnagle
- Enterprise solutions from commodity components, Brian Cantrill
- Growth at Gilt Group, Geir Magnusson, Jr.
- Why some architects almost never shard their applications, Baron Schwartz
I should find more to say about them, but the primary reason I haven’t been blogging much that last few months is that I try to take on too much writing too late in the evening, then run out of steam—so just take my recommendations for what they are.
All of the presentations are supposed to posted within a few weeks. I need to go back and watch a few that I missed. To wit, Artur Bergman on scaling Wikia, Justin Sheehy on Concurrency, Benjamin Black on Netflow (if time) and Neil Gunther on Quantifying Scalability.
So, you wanna be a consultant
On Wednesday, March 10, Sweth Chandramouli spoke on being an independent
technology consultant at the DC-Sage bi-monthly meeting. While the whole talk
was worthwhile, what stood out for me were his distilled principles on how to
build and find business. While he claims that he’s borrowed freely (and
largely from Malcolm Gladwell), I’ve not before seen these principles applied
to consulting. I think they’re particularly useful since sysadmin types
(myself included) can be shy about promoting themselves, but it’s easier to
overcome shyness if you know you are applying ‘correct’ methodology.
Ask for the behavior you want.
Example: Suppose I’m asked what line of work I’m in. We all know the
wrong answer is to say, “I do web engineering,” and leave it at that. It’s also wrong to
answer, “I do web eningeering. Do you know anyone needing to scale out their
web presence, or fix an inadequate web infrastructure?” because “know” is a
passive state and asks nothing of your questioner. Contrast with, “I do web
engineering. Say, can you take a moment and tell me who you know that makes a
living off running a website, and may need help scaling it out, or fixing a
broken infrastructure?”
Now I’ve asked for a concrete action, a behavior, which not only may elict an
immediate response, but is more likely to be recalled later when it
really matters (as when my listener is talking to her CTO a week later).
(On an related note, Sweth pointed out that asking for business cards is more
effective, and cheaper, than distributing them unsolicited)
Reward the behavior (not the outcome) that you desire
Be prolific with feedback and thanks for behaviors that may benefit you, even
if the outcome of any specific behavior does not pan out. Sweth’s example
was to send thank yous for any referral, not just the ones the bring in
business. We also discussed the impact of hand-written thank-yous in an
age of email.
So, the next day I emailed Sweth to get his snail mail address so I could send
him a thank you card. In his response, he noted that I’m more likely to get
DC-SAGE speakers to show up for return gigs if I send the thank you upon
acceptance (rewarding the behavior) instead waiting for the talk to go through
(rewarding the outcome). At which point I sent a thank-you both to Sweth and
to John Nicholson), the May
speaker on Law and Technology.
Lead with the giving hand
Sweth’s third principle was to give something of value to potential clients,
but not to give away any services themselves. His example were the free “Site
Evaluations” that some companies do, but he suggested we find ways that are
not such naked sales pitches.
For me, this was a wake-up call
to return to a practice I’d made for some years of offering, nay, insisting on
giving a ‘Job Talk’ when called in for a job interview. I started doing that
after graduate school because I was used to teaching/research candidates
coming on site and making presentations, and when I was on the job market
(particularly after the Tech Bubble) I thought it was a way to stand out from
the
crowd. But I also liked giving job talks because it gave me something to do in preparation for the interview.
In retrospect, the practice served me doubly well because giving a talk on
some new technology provided something of tangible value to my potential
employer, and built a relationship where the job offer was the
logical step in reciprocation. Score.
Apache redirects with RewriteMap and RewriteCond
At work we’re migrating several thousand research articles from
Zope to another CMS. The CMS folks are taking care of moving the
content, but when we’re done we’re going to institute a boatload
of R=301 redirects from the old URLs to the new URLs.
RewriteMap is the accepted way with mod_rewrite of handling a lot
of one-to-one mappings that don’t follow any particular pattern.
What I figured out today was that I can use a rewrite map in a
RewriteCond statement so I only do the redirect when there’s a match
in the rewrite map lookup.
Here are some snippets from my httpd.conf to illustrate:
For testing, we’ll want logging:
RewriteLog /var/log/httpd/rewrite.log
RewriteLogLevel 2
Define the map we’re using. The content of the map is ‘old_uri new_uri’
with a space separating the two. Use .txt for testing, and the below we’ll
convert to a DBM.
RewriteMap research_map txt:/etc/httpd/conf/research_map.txt
Here, I got a hint from
http://www.tunnell.org/blog_posts_view.php?blog_postid=3. Remember that the
syntax for a RewriteCond is:
RewriteCond TestString ConditionPattern
TestString will be a map lookup of $1, where $1 is the match string of the
following RewriteRule, expressed: ${research_map:$1}
For ConditionPattern,
we will test if the TestString is lexically greater than
””, the empty string, which is what the map lookup returns when there’s no
match. Expressed: >""
RewriteCond ${research_map:$1} >"" # IE, if map result is greater than ""
So if the URL starts with /research, then
use the research_map value for the key $1 to redirect to new address
RewriteRule ^(/research/.*$) ${research_map:$1} [R=301,L]
What we end up with is a few lines of configuration that quickly let me put
in place 3342 new redirects. Here’s the whole stanza:
RewriteMap research_map txt:/etc/httpd/conf/research_map.txt
RewriteCond ${research_map:$1} >"" # IE, if map result is greater than ""
RewriteRule ^(/research/.*$) ${research_map:$1} [R=301,L]
# Same thing, but lookup with a trailing slash if there isn't one
RewriteCond ${research_map:$1/} >""
RewriteRule ^(/research/.*[^/]$) ${research_map:$1/} [R=301,L]
Lastly, converting the textfile to a dbm speeds up the lookup by at least an
order of magnitude.
bash command timeout
At $WORK I’m needing to maintain a backup system wherein our backup server a) starts an SSH process to stop-and-dump our CMS service, then b) SCPs the dumpfile back to the backup servers for writing to tape. I’ve discovered that the stop-and-dump part of the process would hang for 24 hours* when the stop-and-dump perl script exited but the initiating OpenSSH sshd process would not exit, preventing the SCP process from going forward.
I’ve decided to put a command timeout on the SSH process, and here’s how it looks in bash:
# Inspired by:
# http://www.ultranetsolutions.com/BASH-terminate-command-after-timeout.html
cmd_timeout() {
[ $# -eq 2 ] || die "cmd_timeout takes 2 arguments"
command=$1
sleep_time=$2
# run $command in background, sleep for our timeout then kill the process if it is running
# $! has the pid of the backgrounded job
$command &
cmd_pid=$!
# sleep for our timeout then kill the process if it is running
( sleep $sleep_time && kill $cmd_pid && echo "ERROR - killed $command due to timeout $sleep_time exceeded" ) &
killer_pid=$!
# 'wait' for cmd_pid to complete normally. If it does before the timeout is reached, then
# the status will be zero. If the killer_pid terminates it, then it will have a non-zero
# exit status
wait $cmd_pid &> /dev/null
wait_status=$?
if [ $wait_status -ne 0 ]; then
echo "WARNING - command, $command, unclean exit"
else
# Normal exit, detach and clean up the useless killer_pid
disown $killer_pid
kill $killer_pid &> /dev/null
fi
return $wait_status
}
cmd_timeout "ssh myhost some_long-running_command"
next_command
* but I ought to raise this on an openssh mailing list in case it’s a bug, but anyho…
Upgrading to Typo 5.3 (on Dreamhost)
Upgrading from Typo 4.0.3 to Typo 5.3 on Dreamhost Rail
I’ve been slacking lately on my blogging, which is a shame because I have much
to write about. So much going on, no time to write about it. To help rectify
that, I intend to start devoting Tuesday evenings to technology blogging, and
to start things off, let’s upgrade this dumpy old Typo 4.0.3 to shiny new
Typo 5.3.
I’m on Dreamhost, and they suck, but not bad enough for me to pick up and go
elsewhere, so I guess that actually counts as good service. So here are one
set of
correct steps to upgrade a Typo 4.X installation to Typo 5.3. These are
based on the Upgrading to
Typo5.3 instructions,
but differ at a few points for Dreamhost specifics, and to accomodate my own
mistakes.
- login to the Typo web interface and uninstall all your sidebars (see note below)
- use mysqldump to dump the mysql database
mysqldump -u typo -p -h mysql.typo.pburkholder.com typo > ~/backup/typo.mysql
- use tar to copy the typo installation
tar -czf backup/typo.4.0.3.tgz ./typo.pburkholder.com/
- install the ‘typo’ gem. I’ve been using ~/.rubygems as my local gem install, so
export GEM_PATH=/usr/lib/ruby/gems:/home/pburkholder/.rubygems
gem install -i ~/.rubygems typo
- now do the install:
~/.rubygems/bin/typo install ~/typo.pburkholder.com
- move some of my old files over (the upgrade instructions suggest copying over old themes, don’t)
mkdir ~/typo.pburkholder.com/public/files/
cp typo4.pburkholder.com/public/files/* typo.pburkholder.com/public/files/
cp typo4.pburkholder.com/config/database.yml typo.pburkholder.com/config/database.yml
- run the database migration
cd ~/typo.pburkholder.com
rake RAILS_ENV=production db:migrate
rake gems:refresh_specs
(The last step may not be necessary, but worked for me)
- your old theme won’t work. Period. Simplest solution I came up with was to symlink a 5.3 theme to my old theme name. E.g., I had ‘azure’ as my theme, and ‘scribbish’ came with 5.3, so:
cd ~/typo.pburkholder.com/themes; ln -s scribbish azure
- start a single instance on port 3000 and see how things are on http://typo.pburkholder.com:3000
script/server -e production (Ctrl-C to kill)
- set up Passenger. I had been running the blog using FastCGI back before Dreamhost had any real Rails support. Now they support Passenger, which is promising. I had been running Typo using FastCGI suppor and an .htaccess file that
called dispatcher.fcgi, and it was a dog. So..- Go to the Dreamhost panel and enable passenger support
- Disable FastCGI
- Go have fun!!
What if I skipped step 1?
I misread the upgrade instructions, and moved the sidebar code instead of ‘removing’ them from the web UI. My new Typo instance complained bitterly about missing sidebars. To remove the sidebars manually is just a matter of connecting the production MySQL database and seeing what’s in the sidebars, then removing them:
select * from sidebars\G
delete from sidebars where id>1;
Now I need to replace my old sidebar links, such as:
Puppet, round one
In the last few months I’ve switched from an employer that was on their third generation of configuration management (cfengine phase one, cfengine phase two, to puppet) to an environment that has no configuration management and over 50 servers to manage. I’ve been a user of cfengine for over five years and had just started cutting my teeth on Puppet when I switched jobs so I’m finding it pretty painful to be w/o configuration management. Since my efforts to get CM green lighted have been met with “Yes but we’re too busy for the next few months,” I’m going to have to do some groundwork on my own clock.
One task I’d like to devote to a configuration management system is the distribution of the Nagios NRPE.conf file to my hosts.
Until a couple of weeks ago I only would have looked only at Puppet for my CM, but now there’s Chef on the block so I’m to try giving that a gander. I’ll do a quick deployment of NRPE.conf in my home env with each system, and see how it goes.
More from OSCON 2008
WEDNESDAY
Full stack introspection
Theo Schlosssnagle (OmniTI Laurel Md), jesus at omniIT
Author of “Scalable Internet Architectures”
Whole stack introspection
- MacOSX, BSD, Solaris
- Paul Fox working on Linux port. Not in kernel, so CDDL/GPL, tension not a worry
DTrace approaches
- Start w/ User
- Sorcery (not recommended) Twiddle to fix, then verify. Starts w/ Hypotheses
- Experienced. Starts in middle
- It’s very important to know your optimum operation before you can diagnose anomalous behavior.
SysTap on Linux is way, way behind. Requires righting code and sticking in as module—not a way to work on production boxes.
MacOsX is very well instrumented.
Then follows a masterful demonstration of real-time system diagnoses. Obviously this is not something one learns in an afternoon, but certainly is more fruitful (and probably no harder) than what I hack up in grep, perl, or what not to figure our what’s going on in VKS, or ps, or logfiles.
Take aways:
- Dtrace for Linux may be in the offing. Dtrace awesome. Watch for it.
- I must better understand our architecture from ACE to backends.
Groovy vs. JRuby
A very good, well-organized talk. Both environments are ready to use. Groovy goes to Jars. Performance used to be 20% – 90% of pure Java, but rapidly closing. JRuby may make Java integration in the common cases but Groovy handles the edge cases better. Groovy language still a little immature, e.g. Regexs are not transitive. Groovy allows typing (which can be handy), but JRuby does clean duck-typing. JRuby has all Ruby libraries. Rails works in J2EE stack.
CouchDB.
Freakin’ awesome. Written in Erlang (BTW, Ericksson’s switches running erlang advertize 9 9s (0.3 s annually). I’ll need to write more about it later.
OpenSource Virtualization:
Completely lame presentation. One needs to learn that cutesy slides will emphasize, not obscure, your inexperience.
VmWare ESXi will be free with a non-deprecated API, announced June 22.
VirtualBox
- Maps to VmWare Workstation, Player, Fusion, Server
- Tools -> guest additions
- vmx -> xml
- has Seamless (Unity) latyer
- harder: interfaces, network, disk, audio, USB.
- very similar across OS’s, Solaris Host
- Can run headless and over RDP.
Xen. Etc., but I was getting too annoyed by the talk so I caught the rest of the lightning talks and Google open source update.
Google Open Source Update
Brian DiBona and TKTK
- Melange
- Google Highly open participation contest ==
- GHOP
- DROP
Bofs and Parties
System Administration best practices BOF. Nothing new learned, but maybe I made some contributions.
Met a friendly pair of brothers from Baltimore. Mark and Curt Tilmes. Mark works at UMD (five weeks of vacation a year sounds pretty sweet). We went down to the OpenSolaris party at the DoubleTree, which was too redolent of an overamped frat party. The MindTouch people are having a great party up on the top floor the Red Lion, with a tight live band, nice atmosphere and a sweeping view of Portland across the Willamette. Too bad I’m too tired to stay here long and that it’s not better attended.
THURSDAY
New York Times intro to AWS
Brian Gottfried
- Really needed PDFs from 1851-1981
- Dynamically generate or… pregenerate 11 million PDFs and store statically.
- Copy source to S3 store PDFs in S3 (which has been reliable except for last week)
- 4.3Tb of 20M files
- Parallelization. Into pdfs, so map not reduce.
- Non Hadoop part: read metadata, load TIFsS, scale arrange, generate PDF, write PDF
- Minimal use of HDFS
- Took previous code and conv. to S3 thanks to JetS3S
MapReduce
- Map takes a Key/Value pair and output Key/Value pair
- Reduce takes key & list of assoc. values and combines
HDFS (Yahoo! inc)
Cool, scary. Supports mapreduce. Need to read up.
Pushmi for Subversion
Subversion replication. Probably not relevant to me with Git coming onto our scene.
Configuration management
Commerical products are archaic and monolithic
*Tivoli, CA Unicenter
- Help desk, ticketing, asset management, a lot of things badly.
Puppet bcfg2, pikt, cfengine,
Measuring success.
- Availability
- Reduction of incidnets
- Rate of failed change
- Build time and speed to market
- Compliance to SOE, policy or standard
NewStuff:
- Puppetshow A Gui for history and nodes
- iClassify (OS)
- Puppetview Logger (simple PHP app)
Subnode (VMs and containers)
- Zones look like resources
- insides zone looks like a node (but behave differently
Other stuff: (see BOF below)
- webrick – not for more 25 hots. 6000 hosts w/ Mongrel. Scale like any https. w/ loadbalancer.
- puppet query against DB or LDAP.
Supervisor
Python. Built on medusa. Comparable: deamontools, launchd, runit
but built for control of customer processes not pid 0.
- supervisord forks arbitrary programs and child (including apache—which you run in FG since it BGs itself).
- supervisorctl on same machine or over xmlrpc.
- ini config file
- captures child stdin/stdout and logs and rotates
- users cannot stop/start arbitrary processes, only predefined processes
- open or by authentication.
- xmlrpc for your own supervisorctl.
- single-process, thread. Your processes cannot block.
- event-notification system
- in any language
- can spawn fastcgi processes
- Can supervisor replace our bad sudo moderated scripts ??
Puppet BOF
- TKTK from U. Georgia Board of Regents
- TKTK from EngineYard
- Luke Kanies
- James Turnbull from Australian bank (today’s presenter)
- TKTK from ????
- Me
Puppet is in Git now.
Puppet purge nop reports on all aspects of system that aren’t managed.
Ralsh takes purge report and generates manifests.
Scale by having one puppetmasterd dedicated to CA work and put the others behind LB as fileservers and manifest delivery.
Stanford’s configuration takes 30s to parse about 1m to implement.
Svn pre-hook scripts can run parse check. Git lets you run pre-hook on local devel repository and again on core repository
Puppet under pressure to use XMPP instead of HTTP.
New release reduces time by 70% by using YAML instead of Marshall.
Automateit Lightweight config management?
RALSH resource abstraction layer shell
FRIDAY
Keynotes
Mostly ho-hum. Microsoft’s opensource VP walked into the lion’s den and emerged mostly unscathed.
Sun’s TKTK on today’s languages. Reference to langpop.com language ranking based on.
O’reilly book sales heat map. Tiobe index.
"I'd rather drive nails into my head" than code in PHP. (I didn't know they had over 5000 entries in their global namespace). Languages to watch: Groovy, OCaml (fast!), Erlang (stable, concurrent), Scala (see Twitter), Fan.
Terms to look up: “Monkey Patching”
Java isn’t going away because the JVM and APIs are so mature, so the environment is being extended to Ruby, Python, Groovy, etc.
NASA
Curt Tilmes at Goddard
Data: Growth 3.2 tB/day; Archive 4.9PB; daily distribution processing 4.2 Tb
MODIS
250m resolution
Level 1, raw
Level 2, gridded
Level 3, best 8 day w/o clouds if possible
…
http:/www.opensource.org/licenses/nasa1.3.php
http://opensource.sfc.nasa.gov (Goddard specific, but other centers similar)
Curt gave a nice overview of opensource utilization at NASA and their attempts to contribute back. I would have liked more specifics on his struggle just to release a CPAN module. No specific technology takeaways, but I would like to see NIH/NCBI make a more determined effort to contribute on our periphery activities.
RT3/4
I tackled Jesse Vincent. between sessions. Told him about my RT to JIRA project and my reluctance about it.
RT4 is coming out but is going to be a deep rewrite with Jifty, new DB schema, AJAXy and drag-drop workflow scripting generation. No release date set.
rt3.8 has an ‘extract header’ extension for linking into other ticket systems, which could be useful to me.
REST apis have been growing and are supported. Not XML based, can be used from command line.
Base level support is $5k. Gold is $35k. But is based on ticket numbers.
Open Source Electronics
Tom talked about using open source tools to develop your own open-source hardward projects. Recommended references:
- Art of Electronics
- OpAmps for Everyone
- MIT OpenCourseware
OpenSource CAD. Start with Paper. Software: gEDA/gaf + PCB (first one I picked). Should be easy.
gschem gsch2pcb to circuit baord
PDB printing is clean fast cheap, credit card. www.4pcb.com www.sunstone.com, Alberta printed fast and cheap only two layer
Circuit simulation: SPICE. closed source freeware: LTSpice.
NGspice part of gEDA project.
Digital simulation: Verilog.
Failures
Therac massive radiation overdose. Removed hardware interlocks. Killed six people. ETc. with space program, electric utilities, and so on.
Main points:
- test
- test before launch
- test end to end
- listen to your sysadmins
Buzzwords
- open authentication
- xmpp
- jbod
- json
- erlang