2012

Monday, November 5, 2012

Testing Network And TCP Optimizations

This post is more like a "note to self" for certain TCP parameters which I usually modify (or plan to modify) on production servers.

Some good to know terms:

Round Trip Time (RTT): It is the time taken by a packet from source machine to reach destination and come back. You can use ICMP ping to get the RTT.
Latency: The time from the source sending a packet to the destination receiving it. This is often mixed with RTT. Clarify what you are talking about before interpreting anything.
Bandwidth Delay Product (BDP): It is the amount of data that can be in transit in the network or simply the product of link bandwidth and RTT.

Say you want to test your app or benchmark hardware you just bought then first thing you need to do is to add it into the network, even local network will do. Please avoid wireless network because RTT vary a lot for a wireless network and it becomes difficult to see of hardware is at fault or the wireless.

Adding Latency Or RTT Delay To The Network
If you are serious about testing hardware then you may need to test at various RTT/latency values to evaluate the experience of your customers from various locations across the world. To introduce this RTT delay you can use Network Emulator or simply netem and fire the following command:

tc qdisc add dev eth0 root netem delay 100ms

The command above will introduce a RTT delay of 100ms on eth0 interface. Now you can play around with it to check various values of RTT delays. When you are done, remove the delay by deleting the rule.

tc qdisc del dev eth0 root

A awesome tutorial of netem can be found at LinuxFoundation.org. Netem mailing list archives might help in debugging in several cases.

Server Setup For Testing
If you plan to test the hardware then I suggest running a simple and no-frills http server on your hardware like python single threaded server. Using scp for testing is not a good idea since openssh itself had some app level congestion controlling mechanisms. To run python single thread server, fire the following command on your terminal:

cd <doc_root_of_http_server>

python -m SimpleHTTPServer

Make sure a large file is present in the document root of the server and that curl or wget is present on the client. Do not use any browser or download manager to download from server.
Of course if you are testing your app then above thing might not be applicable to you. In that case setup the server and client depending upon your app.

Testing and Recording The Defaults
Recording defaults is important in case you need to revert anything. A full backup can be obtained easily by sysctl command:
sysctl -A > sysctl.bak

Now download the file without introducing any latency from the server. This is the default performance at 0ms added latency. Now let us start the serious testing and introduce latency. Add a RTT of 100ms and download the file using curl or wget and see the speed.

Various TCP Optimizations and Parameters To Check
I just found out during this experiment that new kernels have great settings for TCP, still cross checking won't hurt.

First and foremost get acquainted with /proc of you machine, specifically /proc/sys/net/ directory. I would also encourage you to do through the man page of tcp and understand the parameters.

The changes I am going to suggest depend heavily on kernel version and the distribution. If not done correctly, these changes can degrade networking performance or may harm your machine in any other way. You have been warned. You are on your own.

First of all we'll examine if TCP selective ack is turned on or not and turn it on if it is off. It is boolean so just set the right value to 1 and you are good to go:
sysctl -w net.ipv4.tcp_sack=1
We need to make sure that TCP window can scale to utilize maximum buffer possible:
sysctl -w net.ipv4.tcp_window_scaling=1
Fix the read and write buffers for tcp to an optimum value. It is an array of 3 values which defines minimum, default and maximum values of memory that can be utilized. Also note that this overwrites the values defined for generic (non-tcp) connections in the following files:

/proc/sys/net/core/rmem_max /proc/sys/net/core/wmem_max /proc/sys/net/core/rmem_default /proc/sys/net/core/wmem_default

Setting this is usually heuristic and depends largely on your network. Also with auto scaling on, it can scale up to the maximum value defined. Set it up by using the following command:
sysctl -w net.ipv4.tcp_rmem='4096 87380 4194304' sysctl -w net.ipv4.tcp_wmem='4096 16384 4194304'
Here default memory allocated to receive buffer for each TCP connection would be 87380 bytes and can scale up to 4194304 depending upon the connection. I suggest that you experiment with the values a bit to find the most optimum combination.
If you are doing non-tcp optimizations as well then set net.core.rmem_max, net.core.wmem_max, net.core.rmem_default, net.core.wmem_default as well to similar values.
Enable the TCP time_wait reuse. This would allow the reuse of connections that are in time_wait state. This generally increases performance if your machine is going to make a lot of short lived connections.
sysctl -w net.ipv4.tcp_tw_reuse=1
Maximum number of concurrent connections can sometimes play a role in servers handling high traffic. This can be determined by dividing the difference of values in the file /proc/sys/net/ipv4/ip_local_port_range by the value in /proc/sys/net/ipv4/tcp_fin_timeout. For my system it is (61000-32768)/60 which turns out to be 470. You can increase the range of the ports or you can reduce the tcp_fin_timeout but experiment first before deploying in production.

There are a lot of other parameters that can be tweaked for higher performance. You can try all of them out but do not march straight into production servers with these tweaks. Experiment in your staging boxes first.

Wednesday, October 24, 2012

Common Linux Termination Signals

People issuing a kill -9 on production servers blindly should read this:

If you have ever used Linux terminal, I think you should know what Linux Termination Signals are. They are the signals that you usually supply with kill command when you are trying to kill a process. Depending upon the provided signal further actions can be taken.

SIGTERM: It is the default signal used by kill command. This signal can be handled by the process in the sense that process can choose to ignore this or do some specific action upon receiving this signal. It is like politely asking someone to do something.
SIGKILL: This is the signal sent when you supply -9 as argument to the kill command. This signal cannot be handled or ignored. Process has to die as a result of this signal. Usually this is the last resort thing. If there is a process which is not dying even after receiving a SIGKILL then you should report it to upstream. It is most likely some sort of bug in distribution or the app itself.
SIGHUP: This signal tells the process that user's terminal is disconnected somehow. Usually this happens when you ssh pipe gets broken or you face network connectivity issues etc.

You can get a list of available signals for your distribution by firing kill -l on terminal. For my Fedora 17 box there are 64 signals available. I think the number would be the same for the distributions following POSIX standard.

UPDATE: (As mentioned by @MarcusMoeller on twitter) SIGHUP is commonly used by daemons to reload config files.

Daemontools: To Ensure That Your Processes Are Always Running

In production environment, it is often required to restart the process if it dies. Not doing so fast enough may result in user facing downtime which can impact the business, hence automation of this job is critical to reduce response time.

There are several tools which do similar jobs like Daemontools, Supervisor, bash scripts, Monit etc. We are going to talk about Daemontools since it is very light, implements basic feature set and allows you to do more over its standard commands. It is written by D. J. Bernstein, the creator of djbdns and several awesome tools.

Daemontools runs the processes as its own child process and spawns a new one as soon as the older dies. You have to ensure that the process you are trying to run using daemontools runs in foreground and is not a daemon in itself. The svscan program of Daemontools will scan the present working directory and its subdirectories and launch one supervise process for each run script found.

Installation:
For Fedora, Red Hat and other Enterprise Linux you can use untroubled.org's daemontools rpm. Daemontools itself was written in 11 years ago (and it still works like a charm!) so don't worry about the stale rpms.
For Ubuntu server and similar distributions, you can install deamontools and daemontools-run packages and you would be good to go.

In case you want to do a manual install, instructions can be obtained from the Daemontools website.

Usage:

Once the daemontools is installed, all you need to do is create an executable script named run and put the command to launch the process in the script. Start the svscan tools after this and let the magic happen.

Let us see this though an example. Say you have 2 services, nginx and php-fpm which you want to restart immediately if they die then here is how your directory structure should look like:

Deamontools install dir (usually /etc/service)

├── nginx

│ └── run

└── php-fpm

└── run

run file should have a shebang like #!/bin/bash or #!/usr/bin/python defined and it should have executable bit set. Once you are done with this, start svscan either manually by running svscan command or if you have a SysV init script then use that.

I also suggest you to go through "man svc" to get an idea of commonly used commands of Daemontools.

Monday, September 3, 2012

Grep And Sed Substitution Commands To The Rescue

Since there are a bunch of freshers joining in as sysads in my office, I think it is an appropriate time to write this post about two very basic yet powerful Linux utilities.

Grep: It searches for a given string in the text provided. Along with the pipe character ('|'), I think, it would be the most used command on Linux maybe second to "ls". It'll display the lines containing the strings specified by default. Let us see how the options can be varied to obtain the desired output.

General Syntax:
grep <string> <filename>
You can also do:
cat <filename> | grep <string>

Some helpful flags and options:

Grep to search recursively:
grep -r 'search this' /abc
This flag will make grep to go through all the files in directory /abc and look for word "search this"
Grep to ignore case:
grep -i "search this" abc.txt
This flag will make grep to match against "search this" being insensitive to case. So it'll match "Search THis" or "seaRCh thIs" or "search this" and so on.
Match whole word or whole line:
grep -w "search this" abc.txt grep -x "search this whole line" abc.txt
The first command will macth the whole word. It will not match "search thistoo" or "whysearch this". Similarly the second one will match a whole line and not a sub string.
Invert match or not matching the given string
grep -v "search this" abc.txt
If you want to have all the lines which do not contain "search this" then use the command above.
Print line number with matches:
grep -n "search this" abc.txt
Print the name of the files containing or not containing the matches:
grep -l "search this" abc.txt grep -L "search this" /abc/*
First command will print all the file names which contains the string "search this" while the second one will print all the file names in which "search this" do not appear.

Sed: Also known as Stream Editor. It is a very powerful editor to work on a collection of files simultaneously. Let us learn some basics of sed.

General Syntax:

sed <options> <input>

Some helpful flags and options:

sed 's/old/new/' <original.txt
In the above command s stands for substitute. In this the word "old" will be replaced by word "new" but keep in mind that it won't be written to original.txt. It'll be just displayed on stdout.
sed doesn't mandate the use of '/' as a delimiter. Practically you can use almost anything as the delimiter. For example:
$ echo "aditya's blog" | sed 's|aditya|aditya patawari|' $ echo "aditya's blog" | sed 'sXadityaXaditya patawariX'
will give the same output which is:
aditya patawari's blog
The commands above will replace only the first occurrence of the word in a line. To substitute a word globally in a text pass 'g' flag like below:
$ cat test.txt | sed 's/aditya/aditya patawari/g'
Now if you want to replace and write the output to a file, you can use 'w' flag along with the file name like below (use with caution):
$ cat test.txt | sed 's/aditya/aditya patawari/gw test.txt'
You can also use redirection '>' but it has weird consequences when the same file is being read from and written to.
If you want to make changes in the file in-place then you can use 'i' flag:
$ sed -i 's/aditya/aditya patawari/g' test.txt
If you have more than one change to make then you can join sed commands using -e option.
$ cat test.txt | sed -e 's/aditya/aditya patawari/g' -e 's/patawari/linux/g'
If you have a lots of changes to make then specifying everything commandline is too shabby. Use a sed script in that case:
$ cat script.sed s/e/3/g s/E/3/g s/o/0/g s/O/0/g s/l/1/g s/L/1/g $ cat test.txt | sed -f script.sed
Above will change e,o,l to 3,0,1 respectively.

This is not even the basic. Grep and Sed can do much more. You don't have to learn everything now but you should keep in mind that they are some very powerful, yet simple tools at your disposal.

Wednesday, August 29, 2012

When Is A Server In Production?

As a Systems Engineer/Administrator you will often have calls from people (or bots) telling that a particular app/server is not working as expected. Many times you are going to scratch your head and try to recall "when was this god damn thing put in production?". I had similar moments several times and I have thought of compiling a list for a server/app to be called "in production".

Logging
The app should produce informative logs but should not run on verbose/debug mode under normal circumstances. Logrotate should be in place. A 10 gb log file is mostly useless. Logrorate can create really nice, timestamped log files and will retain it for last X number of days.
Automation
No manual intervention should be there on server. Use Puppet or chef or any thing else but DO NOT touch your production servers manually. Kill anyone who try to do so.
Backups
Everything that can be backed up and is worth a backup, should be backed up. Also make sure that you have tested a restore strategy. No backup is of any use if you cannot restore from it.
Functionality Checks
There is an appropriate functionality check in place. Please don't just check if the process is alive. I have seen several instances when process is up but not serving the requests. Do an end-to-end functionality check. Nagios is an easy to use tool for monitoring individual apps as well as a cluster.
App Owner
There should be a primary and a secondary app-owner of each app. At any given point of time either you should have the capability and authority to debug and revert the app without screwing the rest of the functionality or you should have contact of those who can do the same. Put these people on your speed dial.
Access for Troubleshooting
Developer owners should have easy access to logs instantly. Make sure that they do not go through an entire chain of command just to get access for any troubleshooting. Make sure that he has a user account on the server or a temporary account can be provisioned at a very short notice with sufficient permissions.
Redundancy
Redundancy and fail over should be there and should be well tested. Always have two of each (servers/app instances) and make sure that in the scenarios where one blows out, the other one can take the load. Play Netflix Chaos Moneky, a game where a script gets into your infra and randomly starts killing stuff. It is great to test resilience of your infrastructure.
Security
Secure your machines. Make sure that firewall is always running and is restrictive. The users who left the project or organization, no longer should exist in your servers. Any directory with permissions 777 should be deleted at the expense of the person who set this permission.

I can keep this rant going on for a while but I think these are the bare minimum things that should be present for any production environment be it a huge multinational company or a small startup. These things are not difficult to do but they just tend to be overlooked or assumed obvious or get into the to-do list but never gets done.

Sunday, July 29, 2012

Creating A Self Signed Certificate

This is a very small tutorial for creating self signed certificates or creating a Certificate Signing Request (CSR) for forwarding it to a CA like Thawte or Verisign. Self signed certs are good for testing stuff like https. Here is how you can generate a self signed cert:

Step 1. Create a private key
We'll create a 2048 bit RSA key. You need to secure this key and make sure that it does not fall in wrong hands.


$ openssl genrsa -out my_server.key 2048



Generating RSA private key, 2048 bit long modulus

............................................................+++

......+++

e is 65537 (0x10001)

Step 2. Create a Certificate Signing Request (CSR)

If you are going to buy a certificate signed from a CA then this CSR file is what you need to send. You need to fill in some details, technically X.509 attributes. "Common Name (eg, your name or your server's hostname)" needs to be the fqdn of your server for which you are getting the cert signed.

$ openssl req -new -key my_server.key -out my_server.csr




Country Name (2 letter code) [XX]:IN

State or Province Name (full name) []:Karnataka

Locality Name (eg, city) [Default City]:Bangalore

Organization Name (eg, company) [Default Company Ltd]:My Company

Organizational Unit Name (eg, section) []:My Team

Common Name (eg, your name or your server's hostname) []:adityapatawari.com

Email Address []:your-email@adityapatawari.com

A CSR will be generated after this step which can be shipped to any CA or proceed to next step to sign it yourself.

Step 3. Create a self signed certificate

Let us create a cert with 30 days validity. Note that this will throw up warnings in your browsers and curl may refuse to download anything from such websites unless you explicitly ask to ignore the cert.

$ openssl x509 -req -days 30 -in my_server.csr -signkey my_server.key -out my_server.crt



Signature ok

There! You have your cert. Now go play with it :)

Monday, May 28, 2012

Hadoop: "Format aborted" Error Message

I have recently started playing around with hadoop. I went through a bunch of tutorials and docs and installed it on a Centos 6 box. When I tried to start the namenode, it gave me an error informing that my namenode dir is not formatted. Fair enough but whenever I tried to format it, it used to get aborted. I checked all the configs, user, group, permissions and what not. I read and reread the docs to figure out if I am missing anything but no luck. Every time I got the following error:


-bash-4.1$ hadoop namenode -format

12/05/28 07:33:42 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = hadoop1.staging.example.com/10.10.54.143

STARTUP_MSG:   args = [-format]

STARTUP_MSG:   version = 0.20.2-cdh3u4

STARTUP_MSG:   build = file:///data/1/tmp/topdir/BUILD/hadoop-0.20.2-cdh3u4 -r 214dd731e3bdb687cb55988d3f47dd9e248c5690; compil

ed by 'root' on Mon May  7 14:01:59 PDT 2012

************************************************************/

Re-format filesystem in /data/namenode ? (Y or N) y

Format aborted in /data/namenode

12/05/28 07:33:46 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at hadoop1.staging.example.com/10.10.54.143

************************************************************/

There were no logs or messages to analyze. Every things looked in order then what was the issue?

Turned out that when the CLI asked me "Re-format filesystem in /data/namenode ? (Y or N)" I was supposed to hit upper case Y and not y. I found this quite bad and un-intuitive on developers' part that they did not implement case-insensitivity or gave out a message about hitting upper case when a lower case was entered. I just hope that this post might save someone some time.

Friday, May 25, 2012

How To Downgrade Or Reinstall RPM Package

I had this issue yesterday when I was playing with Fedora 17 Beta. Here is what happened:
1. I installed Fedora 17 Beta and did a yum update.
2. I installed some common things like vim and tree.
3. I tried to install rpm-build and rpmdevtools.

As soon as I did the third step, yum spit out an error saying that the version of "rpm" package I have is newer that what is required. Now there is a problem. Had it been any other package, I could simply have uninstalled the newer version by doing a yum erase and have installed the required version but what do I do now? If I uninstall rpm package then how will I install rpm package again? Yum itself uses rpm in the backend. I wasn't able to find any "force" flag for yum.

A simple solution to the problem above is to use rpm command instead of using yum. Go any of the mirrors and download the rpm package. Now you have the package use "--force" flag and install if via rpm command.

rpm -ivh --force rpm-4.9.1.3-6.fc17.x86_64.rpm

The trick worked well and I was able to resume work.

Wednesday, May 2, 2012

Logrotate: The Most Basic Log Management Tool [Examples]

Logrotate is the default and easiest log management tool around. It is shipped by default with most of the major Linux distributions. Logrotate can help you to rotate logs (in other words, it can create a separate log file per day/week/month/year or on the basis of size of log file). It can compress the older log files. It can run custom scripts after rotation. It can rename the log to reflect the date.

Logrotate scripts goes to /etc/logrotate.d/. Let us see some examples to understand it better. Here we'll rotate /var/log/anyapp.log

1. Rotate logs daily:
$ cat /etc/logrotate.d/anyapp


/var/log/anyapp.log {

        daily

        rotate 7

}

The logrotate script above will rotate /var/log/anyapp.log everyday and it'll keep the last 7 rotated log files. Instead of daily you can use monthly or weekly also.

2. Compress the rotated logs:
$ cat /etc/logrotate.d/anyapp


/var/log/anyapp.log {

        daily

        rotate 7

compress

}

Now you'll find that logrotate is also compressing the rotated files. This is really a big life saver if you want to save some disk space which is a very common use case specially in VPS or cloud environment.
By default logrotate does a gzip compression. You can alter this behavior by using compresscmd. For example "compresscmd /bin/bzip2" will get you bzip2 compression.

3. Compress in the next cycle:
$ cat /etc/logrotate.d/anyapp


/var/log/anyapp.log {

        daily

        rotate 7

compress

delaycompress

}

This is useful in case it is not possible to immediately compress the file. This happens when the process keeps on writing to the old file even after the rotation. If the last line sounded strange to you then you might want to read about inodes. Also note that "delaycompress" will work only if "compress" is included in the script.

4. Compressing the copy of the log:


$ cat /etc/logrotate.d/anyapp

/var/log/anyapp.log {

        daily

        rotate 7

compress

delaycompress

copytruncate

}

Copytruncate comes handy in the situation where process writes to the inode of the log and rotating the log might cause process to go defunct or stop logging or a bunch of other issues. Copytruncate copies the log and the further processing is done on the copy. It also truncates the original file to zero bytes. Therefore the inode of the file is unchanged and process keeps on writing to the log file as if nothing has happened.

5. Don't rotate empty log and don't give error if there is no log:


$ cat /etc/logrotate.d/anyapp

/var/log/anyapp.log {

        daily

        rotate 7

compress

delaycompress

copytruncate

notifempty
missingok

Self explanatory. Both "notifempty" and "missingok" has opposite twins named "ifempty" and "nomissingok" which are the defaults for logrotate.

6. Execute custom script before and/or after logrotation:


$ cat /etc/logrotate.d/anyapp

/var/log/anyapp.log {

        daily

        rotate 7
prerotate

    /bin/myprescript.sh

endscript

postscript

    /bin/mypostscript.sh

endscript

}

You can run multiple scripts/commands as long as they are in between (pre|post)rotate and endscript. I have removed some of the parameters from the script to maintain readability.

I have just scratched the surface of logrotate. In practice it is capable of much more. You should check out logrotate's man page for more options.

Monday, March 26, 2012

Tcpdump: Packet Filtering And Analysis with Examples

Below are some examples which will help you to understand packet filtering better. Again, a word of warning, before performing packet filtering, make sure that you do not run it in promiscuous mode and you have permissions from your employer or you run it in your own network, better if you could create your own network using virtual machines. Make sure that it is legal in your country.

Let us start.

1. Capture everything coming to your machine and write it to dump.pcap:
tcpdump -w dump.pcap

2. Capture only tcp packets and write it to dump.pcap:
tcpdump tcp -w dump.pcap
You can also use udp in place of tcp to capture udp packets

3. Capture all packets from port 8080 and write it to dump.pcap:
tcpdump port 8080 -w dump.pcap
You can use dst port and src port in place of port to specify the filter for destination and source port only.

4. Capture all udp packets with destination port 53 and write it to dump.pcap:
tcpdump udp and dst port 53 -w dump.pcap
Use and to connect two or more types of filters

5. Capture all the packets from interface eth0 only and write it to dump.pcap:
tcpdump -i eth0

6. Capture all the packets coming from 192.168.1.1:
tcpdump src 192.168.1.1 -w dump.pcap

7. Capture all the packets coming from or going to 192.168.1.1:
tcpdump host 192.168.1.1 -w dump.pcap

8. Capture tcp packets from interface wlan0 going to 192.168.1.2 at port 80:
tcpdump -i wlan0 and tcp and dst 192.168.1.2 and port 80 -w dump.pcap

9. Change the size of packet capture:
tcpdump -s 0 -w dump.pcap
Here -s signifies the snap length. You can specify the maximum length of packets to capture. Bigger packets will be captured in truncated manner. Do not specify the highest limit when you don't have to. Taking larger snapshots both increases the amount of time it takes to process packets and, effectively, decreases the amount of packet buffering. Maximum limit as of now is 65535 with 0 representing the same.

10. Capture only 500 packets:
tcpdump -c 500 -w dump.pcap

11. Capture all the packets except coming from or going to 192.168.1.1:
tcpdump not host 192.168.1.1

12. Capture all the packets coming from and going to entire subnet:
tcpdump net 192.168.1.0/24

Next, you can use wireshark to analyse the packets. I wrote a post sometime ago, about wireshark which might help you.

Saturday, February 25, 2012

Scaling Puppet With Apache And mod_passenger

As you keep on adding more and more machines to Puppet, it tends to get slower. A major reason for this is that Puppet uses Webrick by default which it ships with. While webrick is good for testing and small scale deployments, it performs poorly as the number of machines increases. So a good alternative is to use Apache or Nginx in combination with mod_passenger or Unicorn. I'll show you how to use Apache + mod_passenger in a very concise way. I think using other combinations should be equally easy.

Install httpd and mod_passenger into your puppetmaster box. Use Phusion Passenger RPM Repository for mod_passenger if it is not available for your distribution.

# yum install mod_passenger httpd

Now you got to configure passenger. Include the passenger module in apache config and set various params. To keep the configs clean, I recommend creating a separate file and putting passenger related things there. My passenger config looks like this.

Finally let us create a virtual host for the puppet master. Remember that not only apache has to serve catalogs but it also has to take care of SSL as well. So you need to point apache to the certs and the CA created by the puppet master. You might need to install mod_ssl for turning on the SSLEngine. You can check out my apache virtual host config here. RequestHeader(s) are used to set the certificate verification result as environment variable.

You can tweak around a bit, specially with passenger config to suit your infra needs. If at all anything is not clear please ask me in comments.

Saturday, February 18, 2012

Puppet And Common Errors

Installing Puppet can be a nightmare at times especially if you are doing it for the first time. Error messages are not always obvious and would require some experience to understand. So this is my attempt to explain the errors and suggest the solutions.

Needless to say that step one would always be to ensure that the names are resolving and the puppet client and master can communicate. Also make sure that port 8140 is white listed.

Error 1: err: Could not request certificate: getaddrinfo: Name or service not known

Probable Solution: Puppet client is not able to reach the puppet master. This usually happens when you are setting up a new environment and puppet master's name is not resolvable. If you can, put a relevant entry in your DNS and add a server variable in [agent] section in puppet.conf. Alternatively you can use /etc/hosts to point the client to the master but you'll have to add appropriate entries on the /etc/hosts of both the puppet master and client.

Error 2: Starting puppetmaster: Could not prepare for execution: Could not find a default provider for user

Probable Solution: This happens because of SELinux restrictions. You can fix this by running a "setenforce 0" which will turn off the SELinux. This is required for CA creation only. So you can turn on SELinux after the puppet master creates CA successfully.

Error 3: err: Could not request certificate: Retrieved certificate does not match private key; please remove certificate from server and regenerate it with the current key

Probable Solution: Looks like your certificates have gone bad. You should remove /var/lib/puppet/ssl directory and request for new certs signed by puppet master.

Error 4: err: Could not retrieve catalog from remote server: hostname was not match with the server certificate

Probable Solution: This may happen if you are referring to the puppet master by a wrong name. In other words, the CA is not built to use this name. You can check out the correct CA name in the file /var/lib/puppet/ssl/ca/inventory.txt. You should put this name in the [agent] section assigned to server variable.

Error 5: err: Could not retrieve catalog from remote server: Connection refused - connect(2)

Probable Solution: This is happening because your puppet client is not able to connect to puppet master. One reason might be firewall which is rejecting the packets and the other reason might be that puppet master has died. So you either need to relax your firewall or make sure that your puppet master is always up and running. You may want to use daemontools or god or a similar application.

Error 6: Exiting; no certificate found and waitforcert is disabled

Probable Solution: This usually happens when a new node is introduced in the infrastructure. Issue is that this node do not have the certificate yet and since "--waitforcert" flag was not enabled, it exited immediately. If your puppet master has autosign enabled that just add the flag "--waitforcert X" with X replaced with time in seconds like 60. If autosign is not enabled then you have to sign the cert for the client manually at your puppet master.

I'll add more as I encounter them. Please let me know in comments if I am wrong anywhere. Have fun with Puppet :)

Wednesday, January 11, 2012

Strace To Debug And Trace Linux System Calls

A few days ago one the applications on my server was constantly crashing but because of God monitoring framework, it was coming back to life. Only thing which was changing was pid of the process. So we ran strace to check out what all system calls are being executed by process. Assuming that process crashing was python script


ps aux | grep python

strace -p <pid_from_above>

This will show you all the system calls being executed. In my case it was SIGKILL which was killing the process. Actually god itself was executing it since the process was detaching from it and was trying to run as a daemon.

You can use strace for the following use cases:

To check the system calls done by a command. This is helpful to know what all libraries the binary is trying to access.
strace <command>
Run strace echo hello for fun and check out the output.
You can capture the output of strace to a file by passing -o flag and then use grep for analysis.
strace -o output.txt ping 8.8.8.8
Another use case for strace is when any of your application is running unexpectedly slow. Just pass -c flag to strace and you'll get statistics of all the system calls executed. You can also pass -p with -c to supply a pid.
strace -c ping 8.8.8.8

I recommend you to read Solutions for tracing UNIX applications at IBM Developer Works for a more detailed tutorial.