The Boston Diaries' Journal
[Most Recent Entries]
[Calendar View]
[Friends]
Below are the 11 most recent journal entries recorded in
The Boston Diaries' InsaneJournal:
| Tuesday, December 8th, 2009 | | 6:53 am |
The woodpeckers are coming http://boston.conman.org/2009/12/08.1 If builders built buildings the way programmers wrote programs,
then the first woodpecker that came along would destroy
civilization.
Sad to say that's the first thing that came to mind at the end of
tonight's (or rather, this morning's) adventures. Around midnight the Data Center In Boca Raton fell off the face of the
Internet. I caught it just as it happened (checking things out on the new
router I installed at a customer site some six hours earlier) and by the
time I left a voice mail message to our upstream and talked to Smirk (he
called as I was leaving the voice mail message), the Data Center In Boca
Raton was back on the Internet. Shortly after that, I was scanning the logs from snmptrapd
(I have all our routers sending SNMP traps to a central server) I got fed up with seeing
stuff like:
2009-12-06 06:28:08 XXXXXXXXXXXXXXXXXXXXXXXXX [XXXXXXXXXXXXXX]:
SNMPv2-MIB::sysUpTime.0 = Timeticks: (125022780) 14 days, 11:17:07.80
SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-SMI::mib-2.14.16.2.10
SNMPv2-SMI::mib-2.14.1.1 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.7.1.1 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.7.1.2 = INTEGER: 0
SNMPv2-SMI::mib-2.14.10.1.3 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.16.1.3 = INTEGER: 4
SNMPv2-SMI::mib-2.14.4.1.2 = INTEGER: 1
SNMPv2-SMI::mib-2.14.4.1.3 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.4.1.4 = IpAddress: XXXXXXXXXXXXXX
(only on one line). It makes it hard to figure out what the heck the
router is complaining about and I wanted to change the format the MIBs to make them easier to
read. I changed the command line options to snmptrapd only to
get:
/usr/sbin/snmptrapd: symbol lookup error: /usr/lib/libnetsnmpmibs.so.5:
undefined symbol: netsnmp_TCPIPv6Domain
Mind you, it took a good ten minutes of scratching my head over why
/etc/init.d/snmptrapd start wasn't before trying to run it at
the command line. All I know—it was running fine a few minutes before, but not now. I
guess something changed in the 130 days since the server rebooted (my guess:
a new version of snmptrapd without a corresponding new version
of some library—did I mention I hate package managers?). No problem, as I
had a locally installed copy in /usr/local/sbin/snmptrapd I
could use. I rebooted the server (it's a virtual server—takes less than a minute)
when I noticed some odd issues with syslogd. Okay, I'm not running the default syslog that comes with the
distribution—no, I've been testing a homegrown syslog (which
I will get around to talking about—it's quite cool) and it was basically
hanging when starting up (enough that some program called
minilogd was starting up, even though I have no XXXXXXX clue as to what is starting it—I can't find any
reference to it in the startup scripts). Eventually, I figure out it's blocking on a DNS lookup (I'm relaying syslog traffic to a
centralized server, but that's, as Alton Brown says, is another show),
which is odd, because DNS
hasn't been an issue. I check, and I see I'm only using one of the two DNS resolvers we have. I can't resolve. I can ping the DNS server from the server I'm on. I can ssh to the DNS server from the server I'm on. I just can't resolve DNS
queries. Now, the DNS resolver and
the server I'm on are both virtual servers. On the same physical computer. The other resolver? That's a virtual server on another physical computer and yes, I can
resolve fine using that (so I set the default DNS resolver to be the one that is working while I try to
troubleshoot the current issue that shouldn't be happening). We used to have an issue with some virtual servers using that
virtual DNS resolver, but I
thought we had that licked months ago. Maybe it's back? I check iptables everywhere and no … should be fine. A couple of hours go by. I've finally isolated the issue—the resolver itself can't
resolve. But the other one can. It was then I noticed some odd messages being logged to
syslog and coming from our monitoring system:
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;14;
(No Information Returned From Host Check)
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;15;CRITICAL
- Host Unreachable (XXXXXXXXXXXXX)
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;16;CRITICAL
- Host Unreachable (XXXXXXXXXXXXX)
Hmm … our monitoring system in Charlotte can't reach our resolver …
okay, let's do a traceroute from Charlotte to the resolver
and— OH XXXXX XXXXXXX XXXXX ON A XXXXXXX XXXX XXXXX! No wonder I'm having DNS issues—the netblock the resolvers
are on isn't being announced! WXXX TXX FXXX‽ That little outtage around midnight? Apparently our upstream's upstream
had a slightly larger issue and couldn't route (what turned out to be) a few
of our netblocks. We do have multiple connections to the Internet, but …
well … it's a long story, but basically, just running BGP isn't enough—no, we have to
send authorization emails to have the other provider to announce our routes
that normally go through the one that had (and was still having) issues. Okay, so the problem(s) at hand. The fact that the netblock our DNS resolvers were on weren't being
announced would explain why the one resolver couldn't even resolve using
itself; the other resolver probably had a larger working DNS cache and never had to send a
query. I swear, the number of “moving parts” a modern networked computer has
to deal with is amazing, and it's amazing it works at all as well as it
does, when it does. But man, when it breaks, it breaks and it's a
bitch to troubleshoot (especially when you're doing it remotely—why even
suspect the network in such a case?). | | Friday, December 4th, 2009 | | 6:37 pm |
Has it really been that long already? http://boston.conman.org/2009/12/04.1 Today marks the 10th year
of this blog. Not many bloggers have reached this milestone (how many
bloggers do you know who started in 1999? I thought so). I may not have posted consistently, but I'm still chugging away here 10
years later. | | Thursday, December 3rd, 2009 | | 4:02 am |
| | 4:02 am |
This actually doesn't sound half bad … http://boston.conman.org/2009/12/02.2 It's no secret that we've been openly critical of the prices
charged by automakers for built-in GPS navigation systems. Frankly,
paying $2,000 or more for an in-dash system when you can buy
stand-alone navigation units for as little as $100 is ridiculous.
Even the newer, larger seven-inch screen units are now down to as
little as $250, and even though they aren't tied in to a vehicles'
wheel sensors, they tend to be plenty accurate. Now, however, there
is a new option that is even cheaper – as in (sort of) free. It's only "sort of" free because the Google maps turn-by-turn
navigation app is built into the new Motorola
Droid smartphone (see sister-site Engadget's full review of the
Droid here)
that recently became available from Verizon Wireless. In this case,
you have to sign up for two years of mobile phone service, which
includes a data plan. I've been a Verizon customer for a decade and
just happened to be up for a biennial discounted phone upgrade. When
the Droid appeared a few weeks ago, the plan to wait until the new
year for a Palm Pre was discarded. We've now had the chance to play
with the Droid and its new navigation software, so follow
the jump to find out if it lives up to expectations.
Via Instapundit,
Review:
Google Maps turn-by-turn navigation on Android 2.0 —
Autoblog
For Corsair, who loves his
gadgets, and because he hates
AT&T. Personally, I wouldn't mind this. I think the combination of Google Maps and a
GPS is wonderful,
although I wouldn't use the turn-by-turn navigation (map view is fine by me,
with spot checks with the street view to avoid potentially bad areas). I
would also like a larger screen, but hey, you can't have everything. | | Wednesday, December 2nd, 2009 | | 8:17 pm |
I told you handing errors was error prone http://boston.conman.org/2009/12/02.1 I find it even more amusing that you didn't get the error
handling right in the create_socket() on your current
blog post. Notice that you leak the socket and/or memory in the error cases.
I guess it really is hard to handle errors. ;-) Sorry, I just had to take this cheap shot! -MYG
Heh. Yup, I blew it again for demonstration purposes. The code I posted yesterday was
actually pulled from a current project where the
create_socket() is only called during initialization and if it
fails, the program exits. Since I'm on a Unix system, the “lost”
resources like memory and sockets are automatically reclaimed. Not all
operating systems are nice like this. There are a few ways to fix this. One, is to use a langauge that handles
such details automatically with garbage
collection, but I'm using C so that's not an option. The second one is
to add cleanup code at each point we exit, but using that we end up with
code that looks like:
/* ... */
if fcntl(listen->sock,F_GETFL,0) == -1)
{
perror("fcntl(GETFL)");
close(listen->socket);
free(listen);
return -1;
}
if (fcntl(listen->sock,F_SETFL,rc | O_NONBLOCK) == -1)
{
perror("fcntl(SETFL)");
close(listen->socket);
free(listen);
return -1;
}
if (bind(listen->sock,paddr,saddr) == -1)
{
perror("bind()");
close(listen->socket);
free(listen);
return -1;
}
/* ... */
Lots of duplicated code and the more complex the routine, the more
complex the cleanup and potential to leak memory (or other resources like
files and network connections). The other option looks like:
/* ... */
if fcntl(listen->sock,F_GETFL,0) == -1)
{
perror("fcntl(GETFL)");
goto create_socket_cleanup;
}
if (fcntl(listen->sock,F_SETFL,rc | O_NONBLOCK) == -1)
{
perror("fcntl(SETFL)");
goto create_socket_cleanup;
}
if (bind(listen->sock,paddr,saddr) == -1)
{
perror("bind()");
goto create_socket_cleanup;
}
/* rest of code */
return listen->sock; /* everything is okay */
create_socket_cleanup:
close(listen->sock);
create_socket_cleanup_mem:
free(listen);
return -1;
}
This uses the dreaded goto construct, but is one of the few
places that it's deemed “okay” to use goto, for cleaning up
errors. No code duplication, but you need to make sure you cleanup (or
unwind, or whatever) in reverse order. So yeah, error handling … maddening. I still wish there was a better way … | | Tuesday, December 1st, 2009 | | 7:58 pm |
Programs are buggy because error checking is tedious and error prone. Ironic, don't you think? http://boston.conman.org/2009/12/01.2 “There are no easy answers when it comes to reporting and
handling errors.” P. J. Plauger
I release a new version of the
greylist daemon and what happens? Mark finds a bug. Or rather, the new version didn't fix his current problem. I take a look into the issue and it's not terribly surprising that I
botched handing a particular error. Sigh. Error handling was taught by osmosis at college. It was expected that we just
magically pick up on handling errors but really, as long as the program
didn't outright crash and produced something vaguely like the output, all
was fine. In fact, when I expressly asked one my instructors what to do when
handling a particular thorny error, I was told, point blank: “if you don't
know how to handle the error, then don't check for it.” The instructor,
sadly, had a point—you could go mad from trying to handle every
possible error condition. It's not hard to test for errors. In C, just about every function you
can think of can return an error (with a few exceptions, like
getpid() under Unix—if a process can't get its own process
ID, then you have other things to worry about). But checking the return of
every function call gets tedious, fast. What was once a small
concise function:
int create_socket(struct sockaddr *paddr,socklen_t saddr)
{
ListenNode listen;
struct epoll_event ev;
int reuse = 1;
assert(paddr != NULL);
assert(saddr > 0);
listen = malloc(sizeof(struct listen_node));
memset(listen,0,sizeof(struct listen_node));
memcpy(&listen->local,paddr,saddr);
listen->fn = event_read;
listen->sock = socket(paddr->sa_family,SOCK_DGRAM,0);
setsockopt(listen->sock,SOL_SOCKET,SO_REUSEADDR,&reuse,sizeof(reuse));
fcntl(listen->sock,F_GETFL,0);
fcntl(listen->sock,F_SETFL,rc | O_NONBLOCK);
bind(listen->sock,paddr,saddr);
memset(&ev,0,sizeof(ev));
ev.events = EPOLLIN;
ev.data.ptr = listen;
epoll_ctl(g_queue,EPOLL_CTL_ADD,listen->sock,&ev);
return listen->sock;
}
becomes twice as big as each function call is wrapped up in an
if statement:
int create_socket(struct sockaddr *paddr,socklen_t saddr)
{
ListenNode listen;
struct epoll_event ev;
int reuse = 1;
assert(paddr != NULL);
assert(saddr > 0);
listen = malloc(sizeof(struct listen_node));
if (listen == NULL)
return -1;
memset(listen,0,sizeof(struct listen_node));
memcpy(&listen->local,paddr,saddr);
listen->fn = event_read;
listen->sock = socket(paddr->sa_family,SOCK_DGRAM,0);
if (listen->sock == -1)
{
perror("socket()");
return -1;
}
if (setsockopt(listen->sock,SOL_SOCKET,SO_REUSEADDR,&reuse,sizeof(reuse)) == -1)
{
perror("setsockopt()");
return -1;
}
if fcntl(listen->sock,F_GETFL,0) == -1)
{
perror("fcntl(GETFL)");
return -1;
}
if (fcntl(listen->sock,F_SETFL,rc | O_NONBLOCK) == -1)
{
perror("fcntl(SETFL)");
return -1;
}
if (bind(listen->sock,paddr,saddr) == -1)
{
perror("bind()");
return -1;
}
memset(&ev,0,sizeof(ev));
ev.events = EPOLLIN;
ev.data.ptr = listen;
if (epoll_ctl(g_queue,EPOLL_CTL_ADD,listen->sock,&ev) == -1)
{
perror("epoll_ctl(ADD)");
return -1;
}
return listen->sock;
}
And even this isn't all that great. The function prints what happened
(via the perror() call) but that presupposes that
stderr (the standard error reporting file) is open! If it's
not open, well then … if you're lucky, perror() (or whatever
code it eventually calls) checks to see if stderr is open and
if not, just … fail … gracefully? I guess? Hopefully? Maybe? But even if we could return what failed to the caller, then that means
the implementation details get pushed out of the
create_socket() function (which is typically considered a “bad
idea”). Even setting that aside, what can the caller do? Well, not much
really. The socket() call could fail because there's not enough
kernel memory, or there's too many files already open, we don't have
permission to create the socket, or the protocol isn't supported. Don't
have the right privileges? If the process isn't root, there's not much that
can be done (and if the process was root, there wouldn't be an issue). Not
enough kernel memory? I wouldn't even know what to do in that case
except kill off a few processes or reboot the box. But in each case, there
isn't much the program can do except give up (and maybe attempt to log the
error somewhere). And all the other calls are pretty much the same—no memory, no
privileges, can't do it, etc. We can break the errors down into a few
categories: - programming errors, like
EBADF (not an open file,
or the operation couldn't be done given how the file was open
originally) or EINVAL (invalid parameter) that need to
be fixed, but once fixed, never happens again; - it can be fixed, like
EACCESS (bad privileges) or
ELOOP (too many symbolic links when trying to resolve a
filename) but that the fix has to happen outside the scope of the
program, but once fixed, tends not happen again unless someone made
a mistake; - better exit the program as quickly and cleanly as possible
because something bad, like
KNOMEM (insuffient kernel
memory) just happened and things are going bad quickly. Depending
upon the circumstances, a fast, hard crash might be the best thing
to do; - and finally, the small category of errors that a program
might be able to handle, like
ENOENT (file
doesn't exist) depending upon the context (it could then create the
file, or ask the user for a different file, etc.).
The problem being: a program can run for ages before you see an error
(Mark ran my greylist
daemon for a few years before the error manifested itself, and only then
because the server operating system was upgraded—the bug he hit was of the
last category—something it could have handled, but I didn't handle it
properly) so it's not uncommon for error paths in programs to have, well,
errors (ironically enough). In fact, there's really only three ways to handle errors: - every subroutine returns an indication of success or failure (C
uses this—the “calls filter downward, errors bubble upwards”
model) and every call site needs a check;
- subroutines can cause an exception, which immediately transfers
program flow control to some earlier caller up the stack frame,
which caller gets the exception depends upon which caller is
expecting which exception (C++ uses this—the “dynamic
spaghettiesque come-from” model);
- ignore errors entirely and assume everything will always work
(you can do this in any programming language—it's from
the Alfred
E. Neuman “What? Me, worry?” school of programming).
Each method has its pros and cons and nobody is really happy with any of
the methods, but really, that's all there is when you get down to it. The
first is tedious, but doesn't require any special langauge features; the
second requires support in the language, and even so, is really spaghetti code in
hiding and the third … well, again, no special language support is needed,
isn't tedious and it tends to make the code fast but, well … let's just
say that the error recovery can make a programmer go postal. I just wish there was a better way …
Update on Wednesday, Debtember 2nd, 2009
I could claim I left finding the errors in the code above as an exercise for the reader, but really, I blew it.
| | 3:00 am |
| | Sunday, November 29th, 2009 | | 8:26 pm |
How to run Firefix 3.5 under CentOS 4.4 http://boston.conman.org/2009/11/29.1 I finally received a Google Wave Invite (via Smirk) and
decided to give it a try. I go to the link, sign in with my Google email address, and get
the following: I'm using Firefox 2.0.0.20. Yes, it's an
older version, but hey, it works—why fix it?
So I download the latest Firefox (3.5) and try the website again:
[spc]lucy:~/bin/firefox>./firefox
./firefox-bin: error while loading shared libraries:
libpangocairo-1.0.so.0: cannot open shared object file:
No such file or directory
[spc]lucy:~/bin/firefox>
Ah yes, that's what happend the last time I tried running Firefox 3. I'm using CentOS
4.4 as my desktop, and yes, I'm using an older distribution of a
distribution geared towards servers as a desktop. As for the older
distribution part, hey, it works, and I dislike upgrading if I don't have to
(I used RedHat
5.2 (not to be confused for their latest
5.2 offering) for about ten years prior to my upgrade to CentOS 4.4).
As for the server bit, well … I do a lot of server development, and we run
CentOS at The Office™ so it makes my life easier. And I'm used to RedHat/CentOS. So anyway, back to the current issue—Firefox 3.5. The default repositories for CentOS 4.4 don't carry
libpangocairo and after some searching, I found that installing
frysk (ah, so
that's what frysk does) I get
libpangocairo as part of the package, under
/usr/lib/frysk. So, all I need to do is tell Firefox 3.5 where
to load that library and I'm good to go.
[spc]lucy:~/bin/firefox>LD_LIBRARY_PATH=/usr/lib/frysk/ ./firefox
./firefox-bin: error while loading shared libraries:
libdbus-glib-1.so.2: cannot open shared object file:
No such file or directory
[spc]lucy:~/bin/firefox>
Okay … hmm … I do have a libdbus-glib-1.so.0.0.0—what
happens if I symbolically link libdbus-glib-1.so.2 to it?
[spc]lucy:~/bin/firefox>LD_LIBRARY_PATH=/usr/lib/frysk/ ./firefox
./firefox-bin: error while loading shared libraries:
libdbus-1.so.3: cannot open shared object file:
No such file or directory
[spc]lucy:~/bin/firefox>
Okay, let's symbolically link libdbus-1.so.3 to
libdbus-1.so.0.0.0 and see what happens—woot! Success! I'm
running Firefox 3.5! And now I can try out that Google Wave Thang everybody is so hyped
about. | | Thursday, November 26th, 2009 | | 6:14 pm |
Notes from a post-Thanksgiving Dinner http://boston.conman.org/2009/11/26.1 Plates are everywhere. Piles of food still clutter the kitchen. Bunny, her mom and I are slowly waddling from the dining room to the family room to veg out for the next few hours. “I hope ‘Wheel of Fortune’ is still on,” said Bunny. Her mom loves the show. “Why wouldn't it be?” I asked. “There could be a football game on,” she said. “Oh yes, the Detoit Lions and some other team always play on Thanksgiving, don't they?” “Yes.” “Oh, and there's also “A Charlie Brown Thanksgiving” they could be showing. Bunny had the TV on and was flipping through the online guide. “Oh good,” she said. “Wheel of Fortune is on at 7:00 pm.” “And I see ‘Jeopardy’ is on next.” “And oh—look! It's ‘A Charlie Brown Thanksgiving!’” “Wait a second,” I said. “Scroll back a bit … Spanish? It's in Spanish?” “No, there's an alternative audio track in Spanish.” “Ah. So what's ‘Muah muah muah muah muah muah muah’ in Spanish?” | | Tuesday, November 24th, 2009 | | 7:05 pm |
| | 8:13 am |
Our Modern World http://boston.conman.org/2009/11/24.1 Let me get this out right up front—this is a rant. Don't expect any
rational thought here. Anyway … at The Company™ we have a particular network issue.
It's critical, but it isn't “customer screaming on the phone to get it back
up yesterday” critical (although it's getting close to that). We've ruled out The Monopolistic Phone Company as the source of the
problem. That particular part of the network circuit is fine. In fact,
we've isolated the problem to network connection between two cabinets in a
data center (not The Data Center at Boca Raton—this one is in another
city). The problem I'm ranting about is that this particular run of cable
between two cabinets involves seven companies (including us) that need to be co-ordinated to fix this particular issue. Aaaaarg! Smirk started the ball rolling on this yeterday morning. Twelve hours
later, he got a bit further, then a bit further last night. He then told me
to expect a call sometime late. It came at 6:30 am. Grrrrrrrrrrr. From my vantage point, the problem is that we don't have a
straightforward network connection, since it comes through one cabinet
somewhere in the data center to another cabinet somewhere else in the data
center, which apparently isn't a common occurance in this particular data
center. Even rarer, our connection uses VLANs, which moves us from the “rare” column to the “what
the heck is that?” column (I suspect that the intermediary switches the
data center is using to hook the two cabinets together aren't configured for
VLAN traffic, but I
won't know until we get everybody together onto the Conference Call From
Hell sometime in the next few hours or so … ). A part of me wants to blame outsourcing, but seeing how we're one of the
parties being outsourced to (providing the Internet connections and
some specialized routing), I really shouldn't be complaining all that much.
But the sheer number of parties involved is expressly due to a
whole bunch of outsourcing by everybody involved. Now, I
understand the arguments for outsourcing—concentrate on your core
competency and hire other companies to handle the other stuff that's needed
but are outside the scope of your company, but seven companies? For
what appears to be a misconfigured switch? |
|