Tuesday, September 10, 2013

Joining Wonga

After 5 years with ING, I decided to change the course of my professional journey and join Wonga, an UK Startups behind cool technology in micro lending business.  Myself and my family will move to Dublin Ireland where I will work in Wonga Dublin Technology Center.

Wednesday, February 20, 2013

Making chef-client safer

One of main issues we are facing when using chef in our environment is that a configuration file generated by either chef template, cookbook_file or file may exists in incomplete state during short period of time when chef-client modifying it. Processes accessing the configuration file during this period may fail strangely leaving no clear trace.

Let look at a very simple hypothetical example. We create chef template that generates /etc/hosts using data specified in role. At one time we add more machines into this /etc/hosts. What happens behind scene is that chef-client create new temporary file with required content, compare checksum of it with one of actual /etc/hosts and overwrite the /etc/hosts if it is not equals to the recent generated temporary file. During period of overwriting, any processes may see incomplete /etc/hosts, thus may not be able to resolve a hostname even though it already exists in the original file.

Luckily there is a fix for this problem. We can monkey patch chef template, cookbook_file and file providers to follow the well known pattern, create and write to temporary file then use File.rename to rename the temporary file to the final file.

The ruby File.rename uses syscall rename, which guarantees that any processes accessing the configuration file at any point of time always see either complete old or complete new version of the file.

The File.rename requires both old name and new name being in the same mounted filesystem. So oneway to make sure that File.rename will success is to create temporary file in the same directory as one of an overwritten file, which can archive easily by passing the directory of the overwritten file as tmpdir to Tempfile.new(basename, [tmpdir = Dir.tmpdir], options).

Sunday, June 24, 2012

Deterministic Chef template

It is very common that in chef recipe, we generate configuration file using template. Let look a simple example

template "/etc/hosts" do
source "hosts.erb"
owner "root"
group "root"
mode 0644
end

#hosts.erb

<% if @node[:hosts][:entries] %>
<% @node[:hosts][:entries].each do |ip, name| %>
<%= ip %> <%= name %>
<% end %>
<% end %>


In which we use fill data in form of hash structure with key as ip and value as hostname to the template.

The caveat here is that by using hash data structure we can potentially get different /etc/hosts each time we run chef client even though there is no change of data. This is due to no order guarantee of elements in hash data structure.

It does matter or not depending on the context. In case of /etc/hosts the order of host in that file may not be important. But a change notification triggers other action like start/stop other services, it can create un-wanted outcome. In addition there is some sort of compliance file monitoring tool like tripwire in place, it may create annoying alarms.

Anyway we can simply avoid it by create more consistent template in case of /etc/hosts, rewrite the template as following

#hosts.erb

<% if @node[:hosts][:entries] %>
<% @node[:hosts][:entries].keys.sort.each do |ip| %>
<% name = @node[:hosts][:entries][ip] %>
<%= ip %> <%= name %>
<% end %>
<% end %>

Sunday, April 8, 2012

Dump backtrace of all threads in ruby

I have created a few lines of code that allow me to dump stacktrace/backtrace of all threads of a running ruby process. If ruby support Thread#backtrace then it will print backtrace of all threads otherwise it will print of current running thread.
To use it first create a file ruby_backtrace.rb with the following content
require 'pp'

def backtrace_for_all_threads(signame)
  File.open("/tmp/ruby_backtrace_#{Process.pid}.txt","a") do |f|
      f.puts "--- got signal #{signame}, dump backtrace for all threads at #{Time.now}"
      if Thread.current.respond_to?(:backtrace)
        Thread.list.each do |t|
          f.puts t.inspect
          PP.pp(t.backtrace.delete_if {|frame| frame =~ /^#{File.expand_path(__FILE__)}/},
               f) # remove frames resulting from calling this method
        end
      else
          PP.pp(caller.delete_if {|frame| frame =~ /^#{File.expand_path(__FILE__)}/},
               f) # remove frames resulting from calling this method
      end
  end
end

Signal.trap(29) do
  backtrace_for_all_threads("INFO")
end
Then require this file to your ruby script you want to inspect e.g. t2.rb
require 'thread'
require './ruby_backtrace'

def foo
   bar
end

def bar
   sleep 100
end

thread1 = Thread.new do
   foo
end

thread2 = Thread.new do
   sleep 100
end

thread1.join
thread2.join
Finally run the script, send INFO signal to it and look at file ruby_backtrace_pid.txt, where pid is process id
$ ruby t2.rb &
[2] 4719
$ kill -29 4719
$ kill -29 4719
$ cat /tmp/ruby_backtrace_4719.txt 
--- got signal INFO, dump backtrace for all threads at 2012-04-07 17:33:14 +0200
#
["t2.rb:21:in `call'", "t2.rb:21:in `join'", "t2.rb:21:in `
'"] # ["t2.rb:9:in `bar'", "t2.rb:5:in `foo'", "t2.rb:13:in `block in
'"] # ["t2.rb:17:in `block in
'"] --- got signal INFO, dump backtrace for all threads at 2012-04-07 17:33:15 +0200 # ["t2.rb:21:in `call'", "t2.rb:21:in `join'", "t2.rb:21:in `
'"] # ["t2.rb:9:in `bar'", "t2.rb:5:in `foo'", "t2.rb:13:in `block in
'"] # ["t2.rb:17:in `block in
'"]

Wednesday, August 10, 2011

Craft a solid chef-server

The out of box chef server installation is NOT enough to support large scale deployment . To handle hundreds of nodes we need to craft it a bit. I highlight here some configuration tweaks
1. Run multi instances with Apache or Nginx http server as front end.
chef-server-api and chef-server-webui is single process, the only way to scale them are run multi instances of them behind the front end reverse http proxy server.
Also the front end http server can be configured to perform ssl encryption making communication between nodes and chef-server-api secure. Note that chef-server-webui is also chef-client, so use URL of the reverse proxy server to communicate with chef-server-api.
2. Use ruby 1.9
chef server components are hungry to CPU, use ruby 1.9 can boost performance as much as twice comparing to ruby 1.8.
3. Couchdb and big hard disk
Avoid un necessary headache by installing latest version of Couchdb. The version that comes with OS is usually old, buggy, often crashes under high load with hundred GB database. Hard disk is cheap, keep space for Couchdb data files, log files always free around 30%.
Also learn Erlang a bit so you can look and understand crash dump file from Couchdb as well as change its configuration.
4. Solr server
Chef-solr is de facto Solr server for full text search, The default config is pretty small for any serious operation. As it is java application based on Lucene, we need increase java heap size. You also need modify the Solr config (see this tip) to make full text index run faster. Believe me , you will have to rebuild solr index more than you think and proper solr config will save days of frustrated waiting.
5. Rabbitmq-server
Follow the same advice as for Couchdb. Rabbitmq-server is also Erlang application. It need enough memory to work reliably. Give it few GB of RAM if you do not want it crash and then will have to rebuild solr index.
6. Not enough 
Distribute load by running chef-client at different deterministic time avoiding run chef-client of all nodes at the same time.
Consider running each of chef-server component on separate box even one component in many boxes. But don't do it if you have not try all above options as the more complex configuration, the more time you have to spend on it and you end up to serve the chef instead of it serves your infrastructure.
7. Put chef-server configuration in a cookbook
Create your own cookbook to automate all the changes you make on chef-server so you can recreate exact new chef-server using chef-solo in matter of minute (when needed). At the end, who will believe that you will be able to automate the infrastructure if you can not automate your own stuff.

Wednesday, May 25, 2011

Migrate configuration to Chef

One of big headache in implementation Chef or any automate configuration management tool is migration. We usually have to face with a large legacy configuration that is poorly designed and implemented, which make thing worse.
I want to share some of our own experiences in doing migration of a relatively large configuration in term of both scale and complexity.
The tools and techniques
They are
1. a decent, fast and rock solid version control system: I recommend git.
2. semantic diff tools: diff of text file is OK, but not enough, depending on format of your configuration (XML, properties, yaml, JSON), you may need to write your own semantic diff, that is able to make intelligent diff of these formats.
3. source code of these applications that use these configuration: the migration involve legacy applications, access to source code is absolute need to order to do a refactoring.
4. collaboration: with the new tool developers need to change/cleanup the code that access configuration data, operation people need to change their practices, we all need to learn how to use, what to trust, where we need to pay attention.
Shadow configuration file
In order to minimize impact existing system, at the beginning we let chef to create shadow configurations, which then are verified with existing configuration using diff tools. The shadow configuration generated by chef will replace the actual one only after we make sure that they are semantically the same from the point of view of the program using it.
In shadow configuration we can have a configuration file with different prefix/postfix or different directory. e.g. if the actual configuration file is /etc/hosts then the shadow one will be /etc/hosts.chef or /etc-chef/hosts or /root-chef/etc/hosts. I personally prefer the complete shadow directory because it is easier to check, remove when needed.
The cookbook need to have one attribute (can be name of directory where we are going to generate configuration files) saying if we are going to generate shadow or actual configuration.
Conflicting changes
In ideal world, we just has one configuration management system and one administrator who modify the system. But reality is different. There are usually more actors that make the change. So there is potential conflict. E.g. Chef modify one file and then an administrator or scripts or other configuration system don't know and also modify it. The worst case is that everyone assume that the file has the content he expect and thing get broken. These scripts that work for many years suddenly fails.
How to minimize the conflict?. Communicate well with other team members what are managed by Chef, which they should modify through Chef and not manually or by other tools. Various methods can be used e.g. a) wiki pages, b) put a clear notice as comment at the beginning of all files managed by Chef such as "This file is generated by Chef, you shall not modify it as it will be overwritten" c) notify all people each time a file is modified by Chef.
Top-down vs. bottom-up
There is basically two approaches to start the migration project 1) top-down and 2) bottom-up.
1. In top-down approach, you start with create a cookbook/recipe for one configuration item (e.g. syslog-ng, the cookbook will be fairly generic so it is usable by all environments) . Then extract specific data (e.g. ip of syslog server, source of log) from configuration file from each server and put them as attributes into a role. After finishing one configuration item, you continue with other until you have all you need in Chef.
2. With bottom-up approach, we first create one generic cookbook/recipe per server role (e.g. middleware) then you add all configuration files from servers of all environments to it. The structure of cookbook files can be like
      middleware/files/default   # files that are same for all hosts of all env.
      middleware/files/default/dev   #  files that are same for all hosts of development env.
      middleware/files/default/dev/host1   # files that are specific for host1 of evelopment env. 
      middleware/files/default/dev/host2
      middleware/files/default/qa
      middleware/files/default/pro
The recipe can be easily developed to copy these configuration files to a server depending on environment and hostname (node[:hostname]). After everything are in Chef and work well, we will start refactor and split this big generic cookbook to many smaller cookbooks, adding more roles and make thing easily to reuse. This step is kind of continual improvement process aiming at making our chef repository better and better.

In our project we have taken the first approach because it seems more naturally at first sight, however later we change to the second one, which I think have many benefits.
With the second approach, after we put all things in the Chef (can be done pretty quickly), we can announce that from now all changes has to be done through Chef. The very first result is seen immediately: a) there is single point of making change b) change is well audited and communicated as Chef repository is under version control system c)conflicting changes is minimized

Sunday, May 22, 2011

Chef - Different between method defined in recipes and libraries

There is one thing that need an attention when developing a recipe. If we create a method in recipe then that method is not available inside resource block's parameter. I will show it in the following example using Shef, the Chef interactive Ruby console.
[root@localhost gems]# shef

Ohai2u root@localhost!
chef > recipe
chef:recipe >   # here we are inside of a recipe
chef:recipe > def account(path) # create a method
chef:recipe ?>    return 'nobody' if File.dirname(path) == '/tmp'
chef:recipe ?>    return 'root'
chef:recipe ?> end
 => nil 
chef:recipe > file "/tmp/file.test" do
chef:recipe >      action :create
chef:recipe ?>     owner account(name) # try to use this method inside the block and we got a error
chef:recipe ?> end
NoMethodError: undefined method `account' for Chef::Resource::File
 from /usr/lib/ruby/gems/1.8/gems/chef-0.9.8/lib/chef/resource.rb:84:in `method_missing'
One way to workaround is to call the method outside of resource block's parameter so the method is called in the context of recipe
chef:recipe > fname = "/tmp/file.test"
 => "/tmp/file.test" 
chef:recipe > account_val = account(fname)
 => "root" 
chef:recipe > file fname do
chef:recipe >   action :create
chef:recipe ?>  owner account_val
chef:recipe ?> end
chef:recipe >
Other way is to define it as library (put it inside the directory libraries), in shef we would do like that
chef:recipe > exit 
 => :recipe
chef >  # here we are in top context, in which libraries are loaded 
chef > def account(dirname)
chef ?>    return 'nobody' if File.dirname(path) == '/tmp'
chef ?>    return 'root'
chef ?> end
 => nil 
chef > recipe
chef:recipe >   # here we are inside of a recipe
chef:recipe > file "/tmp/file.test" do
chef:recipe >    action :create
chef:recipe ?>   owner account(name) # now it is OK
chef:recipe ?> end
chef:recipe >
In that case, the method is visible in the context of resource.

Friday, April 22, 2011

Some experiences in using Opscode Chef

I am reaching the end of our Chef's implementation project, in which Chef is deployed to manage hundreds of Apache and Jboss servers (Redhat) hosting complex and dynamic banking application.
So it is worth-wise to recap some issues, problems encountered and solutions employed during the course of implementation.
Chef vs other tools
The project actually started from around Aug/Sep last year with the decision to start with Chef instead of other tools Puppet, CF Engine. I choose Chef because CF Engine is just too old while Puppet is too complex, both require to learn new language for expressing configuration and they missed GUI. Other quite subjective reason is that I know Ruby better comparing to other.
The challenges
The technical challenges include the complexity of the configuration being migrated to Chef as well its ever changing nature during the course of implementation. As Chef represents paradigm's shift to what we called infrastructure as code, the most difficult part is to persuade the team to change their habit and to adopt software development practices in their work.
Refactoring
Our environment is complicated and legacy. It was created over years by different guys manually with a bundle of shell scripts, that are far from good in quality from software engineering perspective (consistency, less duplication, good naming). So the implementation also mean refactoring the environment to something more viable in order to automate.
Cookbooks, roles are developed in parallel with renaming of configuration files/directories, cleaning up obsolete stuffs while adding support for new developed applications, environments, at the same time making sure not to break any things.
Single vs multi instances
At the beginning of the project there is no environment supported in Chef so there is two options a) maintain separate Chef instance per environment b) embed environment's information into role. As we have too many environments (development, preproduction, quality assurance, training, laboratory, DRS, production), it would be less work to select second option.
Cookbooks are developed and shared between all environment while roles are created per environment. I use e.g. the following naming convention for role: dev_apache_internet_os to denote role containing attributes for OS related cookbooks of Apache server in Internet zone.
Roles as Ruby code
We have nearly hundred of roles, a server is assigned to at least 2 roles one of OS and other of application specific. That is for separate the concern. OS role will has attributes more less related to OS e.g. pam, ssh, hosts, dns, ntp, postfix, route while application specific are e.g. Apache, Java, Tomcat, Jboss, log4j, etc.
To make it easy to manage; a subdirectory is created for each environment and roles are placed in it. The role format is Plain Old Ruby instead of JSON. This is because comparing to JSON format, the Ruby one is easier to spot error saving a lot of time. When we have hundred lines in a Role's file, it is quite possible that we make some typing errors.
The other reason is that we can do some sort of programming in Role file e.g. it is more convenient using a loop to create mod_jk config that route request to dozen of servers.
One layer above roles
As the number if roles grows, I have seen that many roles of different environments share common attributes. At OS level e.g. we use same DNS and NTP for both production and non production just in case of DRS there is different. Such examples are endless. These attributes are mostly not same in all environments, they are just same in some so it is not wise to put them in attributes's file of relevant cookbooks. Therefore we have created additional layer to support reduce such kind of duplication. These attributes and their values are kept together with applicable environments e.g. in separate file
set_config_item(:item=>"dns",:env=>["dev","pre","qa"],:value=>["192.168.0.1","192.168.0.2"])
and in the Role' file, we fetch the relevant attributes value. e.g.
"dns"=>get_config_item(:item=>"dns",:env=>"dev")
Call Attribute#to_hash in recipe
Cookbooks are commonly used in all environments. Beside infrastructure, we also create cookbooks that maintain our application specific setting such as URL of external partner systems that we communicate with. The fundamental structure are instances and application modules. Many different application modules can be deployed in single instance and the same application module can be deployed in different instances to support of different versions of the same module and dedicate instance per customer delivery channel.
Recipe of a cookbook get data from a role to define resources (file, directory, permission, etc.). As we want make role simple, easy to create and modify, I put more logic in recipe. Some recipes are real programming stuff that need to iterate over structure of a role to get desire data. The autovivify feature of attribute sometime causes problem, in order to avoid it we have to call to_hash for attribute before doing any iteration over it.
Avoid putting big binary in a cookbook
We initially tried to put all files required by a recipe in its cookbook. It turn out to be a bad idea to keep e.g. 200 MB of Jboss installation in Couchdb of Chef. Both knife and Chef-client runs very slow and un reliably. At the end we decided to keep these big binary (more than 20 MB) outside Chef cookbook and setup an Apache for handling these files. The recipe then has to be written to handle checksum, downloading (we use wget) and installation.
We do not have to do it in case of software packages available as rpm, because chef package resource supports their installation.
Make sure that there is no non-ascii character in /etc/passwd,group
chef-client use ohai to parse certain OS configuration files e.g. /etc/password to JSON, if these files contain non-ascii characters, it can causes a problem because JSON lib can not cope with that. So it is better to remove non-ascii from OS configuration files when migrating existing machine to chef.
Disable SELinux
SELinux se store security context per files & directories. When chef-client modify configuration files , it may not retain that information making some daemon not working properly. Chef does not support SELinux so we need to disable it to avoid problem.
Start multi chef-server-api instances for better performance
A default chef-server can not support hundreds of clients, so we installed front end Apache that handles SSL encryption for chef-client as well as balances request to 4 instances of chef-server-api, which we start using option -c 4.
run chef-client as root in crontab
We run chef-client on our non production environments, this is because we want to avoid possible problem with long running ruby process and to reduce the chance that chef-client at many machines run at one moment causing high load to chef-server.
Using crontab we can easily specify time when they will run for each group of machines or even machine. Time to run chef-client can be further generated automatically using hash of machine name to achieve a fair distribution of load.
For user, group management chef-client uses unix command lines, so chef-client must run as root with appropriate path to locations of useradd, groupadd, … (e.g. /usr/sbin,/usr/bin,/bin, /sbin).
The flow of change
The chef-client does not run automatically in production one. This is because in production we want more control as well as due to our change management policy. It is also safer taking into consideration that modification of cookbook can potentially impact production.
The flow of change in production (after cookbook being modified and tested in other environments) involes 1) modify relevant production role, 2) take one production machine from service, 3) execute chef-client on that machine, 4) verify that it work as per expectation, 5) execute chef-client in all remain servers and put all into service.
To execute the chef-client command on a group of machines, we can use sort of Command Control system (e.g. rundesk, control tier, ..) or simple knife.
Hacking, Notification and improvement
To increase safety in the context of missing noop option in chef-client, I have create a monkey patch for certain providers so when chef-client modify a file it notify us by sending output of diff between new and old file by e-mail. The custom notification also send alert to certain group of people depending of nature of change and affected environment.
Beside that, we have create new resources and change behavior of fews existing one to suit more our need. The code base of our cookbook, roles is now reaching 20K line of Ruby code and we continually adding more thing as well improve it.

Tuesday, February 15, 2011

chroot - Change Root

I have recently closed a year old ticket that has some thing to do with chroot. For security reason, we use chroot (in modsecurity) to restrict our Apache process to access only to a desired directory tree. As per chroot document, a program that is re-rooted to another directory cannot access or name files outside that directory, and the directory is called a "chroot jail".
We make sure that chroot is called after Apache process complete it' initialization in order to not break anything, Because otherwise Apache will not be able to access needed share lib, log files, pid file located in various system directories.
However there is a wrong access time the log produced by our Apache. It is always GMT not local time as we want. We have opened ticket with vendor, searching over internet, looking at source code but could not figure out why. The worse thing is that, for first few requests , the access time is correct (local time) but then it gets change to GMT.
Yesterday I have found the reason. I remember that when I ran the strace with the Apache process being chrooted, I saw Apache try to open some files but could not find it. We see a lot of file not found when running strace because various libraries intend to open some file and if it is not found then try others. But in that case the file is /etc/localtime. So it turns out that when logging Apache use apr lib, which call gmtime and mktime, which need to access /etc/localtime. So missing this file in "chroot jail" causes a problem with access time. Without file /etc/localtime gmtime and mktime consider that the machine is in GMT time zone.