Friday, April 22, 2011

Some experiences in using Opscode Chef

I am reaching the end of our Chef's implementation project, in which Chef is deployed to manage hundreds of Apache and Jboss servers (Redhat) hosting complex and dynamic banking application.
So it is worth-wise to recap some issues, problems encountered and solutions employed during the course of implementation.
Chef vs other tools
The project actually started from around Aug/Sep last year with the decision to start with Chef instead of other tools Puppet, CF Engine. I choose Chef because CF Engine is just too old while Puppet is too complex, both require to learn new language for expressing configuration and they missed GUI. Other quite subjective reason is that I know Ruby better comparing to other.
The challenges
The technical challenges include the complexity of the configuration being migrated to Chef as well its ever changing nature during the course of implementation. As Chef represents paradigm's shift to what we called infrastructure as code, the most difficult part is to persuade the team to change their habit and to adopt software development practices in their work.
Refactoring
Our environment is complicated and legacy. It was created over years by different guys manually with a bundle of shell scripts, that are far from good in quality from software engineering perspective (consistency, less duplication, good naming). So the implementation also mean refactoring the environment to something more viable in order to automate.
Cookbooks, roles are developed in parallel with renaming of configuration files/directories, cleaning up obsolete stuffs while adding support for new developed applications, environments, at the same time making sure not to break any things.
Single vs multi instances
At the beginning of the project there is no environment supported in Chef so there is two options a) maintain separate Chef instance per environment b) embed environment's information into role. As we have too many environments (development, preproduction, quality assurance, training, laboratory, DRS, production), it would be less work to select second option.
Cookbooks are developed and shared between all environment while roles are created per environment. I use e.g. the following naming convention for role: dev_apache_internet_os to denote role containing attributes for OS related cookbooks of Apache server in Internet zone.
Roles as Ruby code
We have nearly hundred of roles, a server is assigned to at least 2 roles one of OS and other of application specific. That is for separate the concern. OS role will has attributes more less related to OS e.g. pam, ssh, hosts, dns, ntp, postfix, route while application specific are e.g. Apache, Java, Tomcat, Jboss, log4j, etc.
To make it easy to manage; a subdirectory is created for each environment and roles are placed in it. The role format is Plain Old Ruby instead of JSON. This is because comparing to JSON format, the Ruby one is easier to spot error saving a lot of time. When we have hundred lines in a Role's file, it is quite possible that we make some typing errors.
The other reason is that we can do some sort of programming in Role file e.g. it is more convenient using a loop to create mod_jk config that route request to dozen of servers.
One layer above roles
As the number if roles grows, I have seen that many roles of different environments share common attributes. At OS level e.g. we use same DNS and NTP for both production and non production just in case of DRS there is different. Such examples are endless. These attributes are mostly not same in all environments, they are just same in some so it is not wise to put them in attributes's file of relevant cookbooks. Therefore we have created additional layer to support reduce such kind of duplication. These attributes and their values are kept together with applicable environments e.g. in separate file
set_config_item(:item=>"dns",:env=>["dev","pre","qa"],:value=>["192.168.0.1","192.168.0.2"])
and in the Role' file, we fetch the relevant attributes value. e.g.
"dns"=>get_config_item(:item=>"dns",:env=>"dev")
Call Attribute#to_hash in recipe
Cookbooks are commonly used in all environments. Beside infrastructure, we also create cookbooks that maintain our application specific setting such as URL of external partner systems that we communicate with. The fundamental structure are instances and application modules. Many different application modules can be deployed in single instance and the same application module can be deployed in different instances to support of different versions of the same module and dedicate instance per customer delivery channel.
Recipe of a cookbook get data from a role to define resources (file, directory, permission, etc.). As we want make role simple, easy to create and modify, I put more logic in recipe. Some recipes are real programming stuff that need to iterate over structure of a role to get desire data. The autovivify feature of attribute sometime causes problem, in order to avoid it we have to call to_hash for attribute before doing any iteration over it.
Avoid putting big binary in a cookbook
We initially tried to put all files required by a recipe in its cookbook. It turn out to be a bad idea to keep e.g. 200 MB of Jboss installation in Couchdb of Chef. Both knife and Chef-client runs very slow and un reliably. At the end we decided to keep these big binary (more than 20 MB) outside Chef cookbook and setup an Apache for handling these files. The recipe then has to be written to handle checksum, downloading (we use wget) and installation.
We do not have to do it in case of software packages available as rpm, because chef package resource supports their installation.
Make sure that there is no non-ascii character in /etc/passwd,group
chef-client use ohai to parse certain OS configuration files e.g. /etc/password to JSON, if these files contain non-ascii characters, it can causes a problem because JSON lib can not cope with that. So it is better to remove non-ascii from OS configuration files when migrating existing machine to chef.
Disable SELinux
SELinux se store security context per files & directories. When chef-client modify configuration files , it may not retain that information making some daemon not working properly. Chef does not support SELinux so we need to disable it to avoid problem.
Start multi chef-server-api instances for better performance
A default chef-server can not support hundreds of clients, so we installed front end Apache that handles SSL encryption for chef-client as well as balances request to 4 instances of chef-server-api, which we start using option -c 4.
run chef-client as root in crontab
We run chef-client on our non production environments, this is because we want to avoid possible problem with long running ruby process and to reduce the chance that chef-client at many machines run at one moment causing high load to chef-server.
Using crontab we can easily specify time when they will run for each group of machines or even machine. Time to run chef-client can be further generated automatically using hash of machine name to achieve a fair distribution of load.
For user, group management chef-client uses unix command lines, so chef-client must run as root with appropriate path to locations of useradd, groupadd, … (e.g. /usr/sbin,/usr/bin,/bin, /sbin).
The flow of change
The chef-client does not run automatically in production one. This is because in production we want more control as well as due to our change management policy. It is also safer taking into consideration that modification of cookbook can potentially impact production.
The flow of change in production (after cookbook being modified and tested in other environments) involes 1) modify relevant production role, 2) take one production machine from service, 3) execute chef-client on that machine, 4) verify that it work as per expectation, 5) execute chef-client in all remain servers and put all into service.
To execute the chef-client command on a group of machines, we can use sort of Command Control system (e.g. rundesk, control tier, ..) or simple knife.
Hacking, Notification and improvement
To increase safety in the context of missing noop option in chef-client, I have create a monkey patch for certain providers so when chef-client modify a file it notify us by sending output of diff between new and old file by e-mail. The custom notification also send alert to certain group of people depending of nature of change and affected environment.
Beside that, we have create new resources and change behavior of fews existing one to suit more our need. The code base of our cookbook, roles is now reaching 20K line of Ruby code and we continually adding more thing as well improve it.