# Color Theme Generator

26. October 2011 17:25

I thought I had posted about this site before, but I could not find it and had to go back to a manual search and find mission to get back to it. ColorSchemeDesigner is a pretty flexible site that helps find complementary colors for those of us, like myself, that are artistically impaired.

Check it out!

Tags:

# Game / Code Kata – Krypto

5. September 2011 06:34

I used to read TheDailyWTF, not quite sure why I stopped, maybe I will start again. Anyway, I tucked this one away, Krypto and 24,  for later because the game sounded quite interesting. One of those fun math games with playing cards that is really right up my alley. Here’s how Krypto works:

While most kids reached in their pockets for coins to ante up, we’d pulled the face and joker cards out of the deck, shuffle the rest, and deal out six cards, face-up, in the middle of the table with one of the cards a few inches from the rest.

8♥   5♦   2♣   10♥   5♣       3♠

Our goal was to race to see who could make a mathematical equation using only the four basic arithmetic expressions and parenthesis.

5♣ - (8♥ * 5♦ / 10♥) + 2♣   =   3♠

The first to solve (which, almost never was me) kept the five cards as points, and play continued until we ran out of cards.

We’d also play another variant called 24, where the five cards would have to come to 24. For example...

8♥   5♦   2♣   10♥   5♣   =   24

... would become...

(5♦ + 5♣ - 8♥) * (2♣ + 10♥)   =   24

These games (which I’ve since learned are called Krypto and 24) were – and still are – a lot of fun.

Your exercise for the day: play Krypto and 24. More specifically, write a program to play the games for you.

• The input should be:
• Easy - six integers, the first five of which are between 1 and 10, and the sixth which is between 1 and 10 or is 24
• Hard - an integer array of unknown length and an integer that is a solution of the array
• The output should be:
• Easy - a string representing a single solution to the input
• Medium - a string array representing all possible solutions to the input
• Hard - a string array representing non-duplicated solutions to the input; a solution is considered duplicate if the differences are only associative or commutative

For bonus points, create a function that mimics a game; i.e., starting with 40 cards, removing 5, shuffling again, etc.

Tags:

# More Interview Questions

5. September 2011 05:56

It is a good idea to occasionally run through some interview-type questions to keep yourself up to speed even if you are not planning on interviewing anytime soon. You never know when you may need to interview, plus it just helps to reinforce the basics by making you explain them thoroughly. Here are some more interview questions that Scott Hanselman posted awhile back at New Interview Questions for Senior Software Engineers

• What is something substantive that you've done to improve as a developer in your career?
• Would you call yourself a craftsman (craftsperson) and what does that word mean to you?
• Implement a <basic data structure> using <some language> on <paper|whiteboard|notepad>.
• What is SOLID?
• Why is the Single Responsibility Principle important?
• What is Inversion of Control? How does that relate to dependency injection?
• How does a 3 tier application differ from a 2 tier one?
• Why are interfaces important?
• What is the Repository pattern? The Factory Pattern? Why are patterns important?
• What are some examples of anti-patterns?
• Who are the Gang of Four? Why should you care?
• How do the MVP, MVC, and MVVM patterns relate? When are they appropriate?
• Explain the concept of Separation of Concerns and it's pros and cons.
• Name three primary attributes of object-oriented design. Describe what they mean and why they're important.
• Describe a pattern that is NOT the Factory Pattern? How is it used and when?
• You have just been put in charge of a legacy code project with maintainability problems. What kind of things would you look to improve to get the project on a stable footing?
• Show me a portfolio of all the applications you worked on, and tell me how you contributed to design them.
• What are some alternate ways to store data other than a relational database? Why would you do that, and what are the trade-offs?
• Explain the concept of convention over configuration, and talk about an example of convention over configuration you have seen in the wild.
• Explain the differences between stateless and stateful systems, and impacts of state on parallelism.
• Discuss the differences between Mocks and Stubs/Fakes and where you might use them (answers aren't that important here, just the discussion that would ensue).
• Discuss the concept of YAGNI and explain something you did recently that adhered to this practice.
• Explain what is meant by a sandbox, why you would use one, and identify examples of sandboxes in the wild.
• Concurrency
• What's the difference between Locking and Lockless (Optimistic and Pessimistic) concurrency models?
• What kinds of problems can you hit with locking model? And a lockless model?
• What trade offs do you have for resource contention?
• What's the difference between asynchrony and concurrency?
• Are you still writing code? Do you love it?
• You've just been assigned to a project in a new technology how would you get started?
• How does the addition of Service Orientation change systems? When is it appropriate to use?
• What do you do to stay abreast of the latest technologies and tools?
• What is the difference between "set" logic, and "procedural" logic. When would you use each one and why?
• What Source Control systems have you worked with?
• What is Continuous Integration?  Have you used it and why is it important?
• Describe a software development life cycle that you've managed.
• How do you react to people criticizing your code/documents?
• Whose blogs or podcasts do you follow? Do you blog or podcast?
• What is the last programming book you read?
• Describe, in as much detail as you think is relevant, as deeply as you can, what happens when I type "cnn.com" into a browser and press "Go".
• Describe the structure and contents of a design document, or a set of design documents, for a multi-tiered web application.
• What's so great about <cool web technology of the day>?
• How can you stop your DBA from making off with a list of your users’ passwords?
• What do you do when you get stuck with a problem you can't solve?
• If your database was under a lot of strain, what are the first few things you might consider to speed it up?
• What is SQL injection?
• What's the difference between unit test and integration test?
• Tell me about 3 times you failed.
• What is Refactoring ? Have you used it and it is important? Name three common refactorings.
• You have two computers, and you want to get data from one to the other. How could you do it?
• Left to your own devices, what would you create?
• Given Time, Cost, Client satisfaction and Best Practices, how will you prioritize them for a project you are working on? Explain why.
• What's the difference between a web server, web farm and web garden? How would your web application need to change for each?
• What value do daily builds, automated testing, and peer reviews add to a project? What disadvantages are there?
• What elements of OO design are most prone to abuse? How would you mitigate that?
• What's YAGNI? Is this list of questions an example?
• Describe to me some bad code you've read or inherited lately.

Tags:

# Code Kata – Bonus

5. September 2011 05:33

This one came courtesy of Ted Neward, who got it from the University of Virginia’s programming contest way back when:

# RoboStack

Problem: You work for a manufacturing company, and they have just received their newest piece of super-modern hardware, a highly efficient assembly-line mechanized pneumatic item manipulator, also known in some circles as a "robotic arm". It is driven by a series of commands, and your job is to write the software to drive the arm. The initial test will be to have the arm move a series of blocks around.

Context: The test begins with n number of blocks, laid out sequentially next to each other, each block with a number on it. (You may safely assume that n never exceeds 25.) So, if n is 4, then the blocks are laid out (starting from 0) as:

0: 0

1: 1

2: 2

3: 3

The display output here is the block-numbered "slot", then a colon, then the block(s) that are stacked in that slot, lowest to highest in left to right order. Thus, in the following display:

0:

1:

2: 0 1 2 3

3:

The 3 block is stacked on top of the 2 block is stacked on top of the 1 block is stacked on top of the 0 block, all in slot 2. This can be shortened to the representation [0:, 1:, 2: 0 1 2 3, 3:] for conciseness.

The arm understands a number of different commands, as well as an optic sensor. (Yeah, the guys who created the arm were good enough to write code that knows how to read the number off a block, but not to actually drive the arm. Go figure.) The commands are as follows, where a and b are valid block numbers (meaning they are between 0 and n-1):

• "move a onto b" This command orders the arm to find block a, and return any blocks stacked on top of it to their original position. Do the same for block b, then stack block a on top of b.
• "move a over b" This command orders the arm to find block a, and return any blocks stacked on top of it to their original position. Then stack block a on top of the stack of blocks containing b.
• "pile a onto b" This command orders the arm to find the stack of blocks containing block b, and return any blocks stacked on top of it to their original position. Then the arm must find the stack of blocks containing block a, and take the stack of blocks starting from a on upwards (in other words, don't do anything with any blocks on top of a) and put that stack on top of block b.
• "pile a over b" This command orders the arm to find the stack of blocks containing block a and take the stack of blocks starting from a on upwards (in other words, don't do anything with any blocks on top of a) and put that stack on top of the stack of blocks containing block b (in other words, don't do anything with the stack of blocks containingb, either).
• "quit" This command tells the arm to shut down (and thus terminates the simulation).

Note that if the input command sequence accidentally offers a command where a and b are the same value, that command is illegal and should be ignored.

As an example, then, if we have 4 blocks in the state [0: 0, 1: 1, 2: 2, 3: 3], and run a "move 2 onto 3", we get [0: 0, 1: 1, 2:, 3: 3 2]. If we then run a "pile 3 over 1", we should end up with [0: 0, 1: 1 3 2, 2:, 3:]. And so on.

Input: n = 10. Run these commands:

1. move 9 onto 1
2. move 8 over 1
3. move 7 over 1
4. move 6 over 1
5. pile 8 over 6
6. pile 8 over 5
7. move 2 over 1
8. move 4 over 9
9. quit

The result should be [0: 0, 1: 1 9 2 4, 2:, 3: 3, 4:, 5: 5 8 7 6, 6:, 7:, 8:, 9:]

Challenges:

• Implement the Towers of Hanoi (or as close to it as you can get) using this system.
• Add an optimizer to the arm, in essence reading in the entire program (up to "quit"), finding shorter paths and/or different commands to achieve the same result.
• Add a visual component to the simulation, displaying the arm as it moves over each block and moves blocks around.
• Add another robotic arm, and allow commands to be given simultaneously. This will require some thought—does each arm execute a complete command before allowing the other arm to execute (which reduces the performance having two arms might offer), or can each arm act entirely independently? The two (or more) arms will probably need separate command streams, but you might try running them with one command stream just for grins. Note that deciding how to synchronized the arms so they don't conflict with one another will probably require adding some kind of synchronization instructions into the stream as well.

Tags:

# Code Kata 16–21 of 21

5. September 2011 05:31

I came across this site quite a while back and have been meaning to blog about it. The site is CodeKata, and the premise is essentially to give reasonably interesting problems that can be solved using code in order to practice and hone your skills. Not everything is a coding problem, but where I can see these coming in handy are using the katas to learn new languages. I always learn better when I have something to implement as I learn. Anyway, my original intention was to just link to the site, however it looks like the site hasn’t been updated in 4 years! It concerns me that it might not be around if I merely link to it, so I am going to copy the katas here for future reference. I hope Dave doesn’t mind.

# Kata 16 – Business Rules

How can you tame a wild (and changing) set of business rules?

Imagine you’re writing an order processing application for a large company. In the past, this company used a fairly random mixture of manual and ad-hoc automated business practices to handle orders; they now want to put all these various ways of hanadling orders together into one whole: your application. However, they (and their customers) have come to cherish the diversity of their business rules, and so they tell you that you’ll have to bring all these rules forward into the new system.

When you go in to meet the existing order entry folks, you discover that their business practices border on chaotic: no two product lines have the same set of processing rules. To make it worse, most of the rules aren’t written down: you’re often told something like "oh, Carol on the second floor handles that kind of order."

During first day of meetings, you’ve decided to focus on payments, and in particular on the processing required when a payment was received by the company. You come home, exhausted, with a legal pad full of rule snippets such as:

• If the payment is for a physical product, generate a packing slip for shipping.
• If the payment is for a book, create a duplicate packing slip for the royalty department.
• If the payment is for a membership, activate that membership.
• If the payment is an upgrade to a membership, apply the upgrade.
• If the payment is for a membership or upgrade, e-mail the owner and inform them of the activation/upgrade.
• If the payment is for the video "Learning to Ski," add a free "First Aid" video to the packing slip (the result of a court decision in 1997).
• If the payment is for a physical product or a book, generate a commission payment to the agent.
• and so on, and so on, for seven long, yellow pages.

And each day, to your horror, you gather more and more pages of these rules.

Now you’re faced with implementing this system. The rules are complicated, and fairly arbitrary. What’s more, you know that they’re going to change: once the system goes live, all sorts of special cases will come out of the woodwork.

##### Objectives

How can you tame these wild business rules? How can you build a system that will be flexible enough to handle both the complexity and the need for change? And how can you do it without condemming yourself to years and years of mindless support?

# Kata 17 -  More Business Processing

The rules that specify the overall processing of an order can be complex too, particularly as they often involve waiting around for things to happen.

In Kata Sixteen we had a look at the business rules that applied when we received payment for various kinds of product. Handling payments is just a small part of the overall workflow required to process an order. For the company whose application we’re looking at, order processing looks something like:

• If we accept an order over the web, then we have to wait for payment to arrive, unless it’s a credit-card order. In the case of credit card orders, we can process the order immediately and ship the goods, but only if the goods are in stock. If they are not currently in stock, we have to delay processing credit card orders until the become available again.
• We can receive a check in payment of an existing order, in which case we ship the goods (unless they are not in stock, in which case we hold the check until the goods become available).
• We can receive a purchase order (PO) for a new order (we only accept these from companies with an established credit account). In this case we ship the goods (assuming they are in stock) and also generate an invoice against the PO. At some point in the future we’ll receive payment against this invoice and the order will be complete.
• At any time before the goods are shipped the customer may cancel an order.

Each step in this process may occur many days after the previous step. For example, we may take an order over the web on Monday, receive a check for this order on Thursday, then ship the goods and deposit the check when the item is received from our own suppliers on the following Tuesday.

How can we design an application to support these kinds of rules? Of course, businesses change the rules all the time, so we better make sure that anything we come up makes it easy to update the workflow.

# Kata 18 – Transitive Dependencies

Let’s write some code that calculates how dependencies propagate between things such as classes in a program.

Highly coupled code is code where the dependencies between things are dense, lots of things depend on other things. This kind of program is hard to understand, tough to maintain, and tends to be fragile, breaking easily when things change.

There are many different kinds of coupling in code. One of the easiest to work with programatically is \emph{static coupling}, where we’re concerned with the relationships between chunks of source code. Simplisticly, we can say that class A is statically coupled to class B if the compiler needs the definition of B in order to compile

1. In many languages, static dependencies can be determined by source

code analysis. Tools such as makedepend (for C programs) and JDepend (for Java) look for explicit dependencies in the source and list them out.

One of the insidious things about dependencies is that they are transitive—if A depends on B and B depends on C, then A also depends on C. So, let’s write some code that calculates the full set of dependencies for a group of items. The code takes as input a set of lines where the first token is the name of an item. The remaining tokens are the names of things that this first item depends on. Given the following input, we know that A directly depends on B and C, B depends on C and E, and so on.

  A   B   C
B   C   E
C   G
D   A   F
E   F
F   H

The program should use this data to calculate the full set of dependencies. For example, looking at B, we see it directly depends on C and E. C in turn relies on G, E relies on F, and F relies on H. This means that B ultimately relies on C, E, F, G, and H. In fact, the full set of dependencies for the previous example is:

  A   B C E F G H
B   C E F G H
C   G
D   A B C E F G H
E   F H
F   H

Let’s express this using unit tests. The following code assumes that our dependency calculator is a class with an add_direct() method to add the direct dependencies for an item, and a dependencies_for() method to return the full list of dependencies for an item. The code uses Ruby’s %w{…} construct, which builds an array of strings from its argument.

  def test_basic
dep = Dependencies.new
dep.add_direct('A', %w{ B C } )
dep.add_direct('B', %w{ C E } )
dep.add_direct('D', %w{ A F } )

assert_equal( %w{ B C E F G H },   dep.dependencies_for('A'))
assert_equal( %w{ C E F G H },     dep.dependencies_for('B'))
assert_equal( %w{ G },             dep.dependencies_for('C'))
assert_equal( %w{ A B C E F G H }, dep.dependencies_for('D'))
assert_equal( %w{ F H },           dep.dependencies_for('E'))
assert_equal( %w{ H },             dep.dependencies_for('F'))
end

Stop reading now, and start coding. Once you’ve got your code working, try feeding it the following dependencies.

  A B
B C
C A

Does it work correctly? If not, ask yourself is this is a condition you should have considered during testing.

Once you’ve got your code working with all the various test cases you can imagine, let’s think for a minute about performance. Say we were using this code to find all the relationships between the inhabitants of the United Kingdom. How would your code perform with 55-60 million items?

# Kata 19 – Word Chains

Write a program that solves word-chain puzzles.

There’s a type of puzzle where the challenge is to build a chain of words, starting with one particular word and ending with another. Successive entries in the chain must all be real words, and each can differ from the previous word by just one letter. For example, you can get from "cat" to "dog" using the following chain.

  cat
cot
cog
dog

The objective of this kata is to write a program that accepts start and end words and, using words from thedictionary, builds a word chain between them. For added programming fun, return the shortest word chain that solves each puzzle. For example, my Powerbook running Ruby can turn "lead" into "gold" in four steps (lead, load, goad, gold), taking about 20 seconds. Turning "ruby" into "code" takes six steps (ruby, rubs, robs, rods, rode, code) and 90 seconds, while turning "code" into "ruby" (again in six steps) takes about an hour. Go figure…

Update: It turns out that my original algorithm was pretty dumb: a better approach greatly speeds up search and makes it symetrical. It now does all the above examples in between 0.5s and 1s.

# Kata 20 – Klondike

Experiment with different heuristics for playing the solitaire game Klondike.

The solitaire game Klondike is probably the most widely played in the world (particularly if you’re a Window’s user, where it has been available since Windows 3.1). It’s a simple game.

Cards are dealt face down onto the seven piles in the tableau. The first pile receives one card, the next two, up to the seventh pile, which not surprisingly has seven cards initially. The top card on each pile is turned face-up, and the undealt cards are placed on the Stock pile, all face down. Here’s a picture of the game (52kb) if you haven’t seen it before.

The idea is to build up four piles of cards in their suits on the foundation area (one pile for the clubs, one for the diamonds, and so on). The piles must start with the ace and end with the king.

The available moves (in no particular order) are:

• If the Stock becomes empty, turn the entire discard pile over and make it the new Stock.
• Turn over the top card of the Stock and place it face-up on the Discard pile.
• Move a card from the tableau or discard pile to one of the foundation piles. If the foundation pile is empty, only an Ace can be placed there, otherwise only the next highest card in the appropriate suit can be placed (so if a foundation pile is currently showing a four of hearts, only the five of hearts may be placed there).
• Move the top card of the discard pile to one of the tableau piles. This card must be one less in rank and opposite in color to the card at the top of the destination tableau.
• Move one or more cards from one tableau pile to another. If multiple cards are moved, they must be a sequence ascending in rank and alternating in color. The card moved (or the top of the sequence moved) must be one less in rank and opposite in color to the card at the top of the destination tableau. If the move leaves a face-down card to the top of the original pile, turn it over.
• If a move leaves a tableau pile empty, an exposed King at the top of a tableau or discard pile, or a sequence starting with a King on a tableau pile, may be moved to it.

So, in the game pictured about, a possible first set of moves might be:

1. Move the Queen of Hearts onto the King of Clubs.
2. This leaves the first pile in the tableau empty, so the combined King/Queen can be moved to it. The card originally beneath the King is turned over.
3. The Jack of Clubs can be moved on to the Queen of Hearts, and the card beneath the Jack revealed.

The game is won when all cards are moved to the foundation, and lost when the only remaining moves form an endless loop.

The game is simple to play, but the strategy isn’t immediately obvious. For example, is it always a good idea to move a card from the tableau to the foundation, or is it sometimes better to leave it there to give yourself something to build down on? Is it a good idea to make a move which leaves a tableau pile empty if you don’t immediately have a King to move into the gap? If you have two possible moves which will result in exposing a new tableau card, should you expose the one on the longest or shortest tableau?

This kata is in two parts.

1. Come up with an infrastructure so you can have the computer deal and play games of Klondike.
2. Use that infrastructure to experiment with strategies to see if you can increase the number of times you win (perhaps you could tabulate the number of times the machine wins a random set of 1,000 games for each strategy you try).

#### Objectives

• At one level, this is an exercise in design—how can the basic game be modeled in code? Where should the validation of moves be located (in the cards, in the various piles, in some kind of game overseer, or …)? How can we detect that we’ve lost?
• Once the basic game is in place, it becomes an exercise in imagination: what strategies can be applied, and how do various sub-strategies interact?

# Kata 21 – Simple Lists

Perhaps there’s more to the humble list of values than you might think. Let’s experiment with some basic list processing.

Lists are one of the first data structures we learn as programmers. But familiarity doesn’t mean that we can’t learn a little from them.

For this kata we’re going to code up three implementations of a list that has the following basic interface:

• The list consists of nodes. Each node has a string value, along with whatever housekeeping the list itself needs.
• New nodes are added to the end of the list.
• You can ask the list if it contains a given string. If it does, it returns the node containing that string.
• You can delete a node from the list.
• You can ask the list to return an array of all its values.

Here’s a basic set of unit tests to illustrate the behavior.

    list = List.new
assert_nil(list.find("fred"))
assert_equal("fred", list.find("fred").value())
assert_nil(list.find("wilma"))
assert_equal("fred",  list.find("fred").value())
assert_equal("wilma", list.find("wilma").value())
assert_equal(["fred", "wilma"], list.values())

list = List.new
assert_equal(["fred", "wilma", "betty", "barney"], list.values())
list.delete(list.find("wilma"))
assert_equal(["fred", "betty", "barney"], list.values())
list.delete(list.find("barney"))
assert_equal(["fred", "betty"], list.values())
list.delete(list.find("fred"))
assert_equal(["betty"], list.values())
list.delete(list.find("betty"))
assert_equal([], list.values())

For the kata, where the idea is to practice, let’s write three implementations of the list:

1. A singly linked list (each node has a reference to the next node).
2. A doubly linked list (each node has a reference to both the next and previous nodes).
3. Some other implementation of your choosing, except that it should use no references (pointers) to collect nodes together in the list.

Obviously, we won’t be using predefined library classes as our list implementations…

##### Objectives

There’s nothing magical or surprising in list implementations, but there are a fair number of boundary conditions. For example, when deleting from the singly-linked list, did you have to deal with the case of deleting the first element in the list specially?

For this kata, concentrate on ways of removing as many of these boundary conditions as possible. Then ask yourself: Is the resulting code, which will contain fewer special cases, easier to read and maintain? How easy was it to eliminate these special cases? Were there trade-offs, where removing a special case in one area complicated the code in another. Is there a sweet-spot when it comes to simplifying code?

Tags:

# Code Kata 11-15 of 21

5. September 2011 05:27

I came across this site quite a while back and have been meaning to blog about it. The site is CodeKata, and the premise is essentially to give reasonably interesting problems that can be solved using code in order to practice and hone your skills. Not everything is a coding problem, but where I can see these coming in handy are using the katas to learn new languages. I always learn better when I have something to implement as I learn. Anyway, my original intention was to just link to the site, however it looks like the site hasn’t been updated in 4 years! It concerns me that it might not be around if I merely link to it, so I am going to copy the katas here for future reference. I hope Dave doesn’t mind.

# Kata 11 – Sorting it Out

Just because we need to sort something doesn’t necessarily mean we need to use a conventional sorting algorithm.

We use sorting routines all the time; putting customer records in to name order, arranging orders by value (and even sorting the letters in a word back in KataSix). Most of the time we (wisely) use one of the sort routines built in to our language’s library (such as C’s qsort and Java’s java.Collections.sort). After all, very clever folks spent a lot of time getting these library routines tuned for speed and/or memory usage.

However, there are times when whipping up a sort of our own can outperform these generic routines. Our challenge this week is to implement a couple of different sorts. (However, at the risk of giving the game away, these sorts both have something in common).

##### Sorting Balls

In the Pragmatic Lottery (motto: There’s One Born Every Minute, and it Might Just Be You!), we select each week’s winning combination by drawing balls. There are sixty balls, numbered (not surprisingly, as we are programmers) 0 to 59. The balls are drawn by the personable, but somewhat distracted, Daisy Mae. As a result, some weeks five numbers are drawn, while other weeks seven, nine, or even fifteen balls make it to the winner’s rack. Regardless of the number of balls drawn, our viewers need to see the list of winning numbers in sorted order just as soon as possible. So, your challenge is to come up with some code that accepts each number as it is drawn and presents the sorted list of numbers so far. The tests might look something like:

    rack = Rack.new
assert_equal([], rack.balls)
assert_equal([ 20 ], rack.balls)
assert_equal([ 10, 20 ], rack.balls)
assert_equal([ 10, 20, 30 ], rack.balls)
##### Sorting Characters

Our resident conspiracy expert, Dr. X, is looking for hidden messages in the collected publications of Hugh Hefner. Dr. X believes the message is hidden in the individual letters, so rather than get distracted by the words, he’s asked us to write a program to take a block of text and return the letters it contains, sorted. Given the text:

  When not studying nuclear physics, Bambi likes to play
beach volleyball.

our program would return:

  aaaaabbbbcccdeeeeeghhhiiiiklllllllmnnnnooopprsssstttuuvwyyyy

The program ignores punctuation, and maps upper case to lower case.

Are there any ways to perform this sort cheaply, and without using built-in libraries?

# Kata 12 – Best Sellers

Consider the implementation of a top-ten best sellers list for a high volume web store.

A GedankenKata this week: no code needed (although writing short prototypes might help you come to a conclusion).

Say you’re writing code for an online site that sells things (something like Amazon). Your site is wildly popular, and you sell millions of items each day.

The marketing department wants the home page to display a top-ten list of the best selling items over the last 24 hours, with the list being updated each hour.

• How would you implement this?
• Are there any changes you could ask for to make the implementation easier?
• What would be the impact if they later came back and said:
• only update the list once per day; or
• we need the list updated in real time: each time the home page is displayed we need the list to reflect the 24 hours up until that point.

This kata might be deeper than it first appears. You might want to consider database vs. in-memory solutions, data structures that allow aging, time-space tradeoffs, and the like.

# Kata 13 – Counting Code Lines

Counting lines of code in Java source is not quite as simple as it seems.

This week let’s write something vaguely useful: a utility that counts
the number of lines of actual code in a Java source file. For the purpose
of this exercise, a line is counted if it contains something other than
whitespace or text in a comment. Some simple examples:

  -   // This file contains 3 lines of code
1   public interface Dave {
-     /**
-      * count the number of lines in a file
-      */
2     int countLines(File inFile); // not the real signature!
3   }

and…

  -   /*****
-   * This is a test program with 5 lines of code
-   *  \/* no nesting allowed!
-   //*****//***/// Slightly pathological comment ending...
-
1  public class Hello {
2      public static final void main(String [] args) { // gotta love Java
-          // Say hello
3        System./*wait*/out./*for*/println/*it*/("Hello/*");
4      }
-
5  }

Remember that Java comments are either "//" to the end of line, or "/*" to the next "*/". The block comments do not nest. There may be multiple /*…*/ comments on a line. Whitespace includes tabs, spaces, carriage returns, and vertical tabs. Oh, and remember that comment start sequences that appear inside Java strings should be ignored.

##### Goals of the Kata

The mixture of line-based things (single line comments, blank lines, and so on) with the stream-based block comments can make solutions slightly ugly. While coding your solution, consider the structure of your code, and see how well it fits the structure of the problem. As with most of these kata, consider coding multiple alternative implementations. Does what you learned on the first tries affect your approach to subsequent ones?

# Kata 14 – Tome Swift Under Milk Wood

Trigrams can be used to mutate text into new, surreal, forms. But what heuristics do we apply to get a reasonable result?

As a boy, one of my treats was go to the shops on a Saturday and spend part of my allowance on books; for a nine-year old, I had quite a collection of Tom Swift and Hardy Boys. Wouldn’t it be great to be able to create more and more of these classic books, to be able to generate a new Tom Swift adventure on demand?

OK, perhaps not. But that won’t stop us trying. I coded up a quick program to generate some swash-buckling scientific adventure on demand. It came up with:

…it was in the wind that was what he thought was his companion. I think would be a good one and accordingly the ship their situation improved. Slowly so slowly that it beat the band! You’d think no one was a low voice. "Don’t take any of the elements and the inventors of the little Frenchman in the enclosed car or cabin completely fitted up in front of the gas in the house and wringing her hands. "I’m sure they’ll fall!"
She looked up at them. He dug a mass of black vapor which it had refused to accept any. As for Mr. Swift as if it goes too high I’ll warn you and you can and swallow frequently. That will make the airship was shooting upward again and just before the raid wouldn’t have been instrumental in capturing the scoundrels right out of jail."

Stylistically, it’s Victor Appleton meets Dylan Thomas. Technically, it’s all done with trigrams.

Trigram analysis is very simple. Look at each set of three adjacent words in a document. Use the first two words of the set as a key, and remember the fact that the third word followed that key. Once you’ve finished, you know the list of individual words that can follow each two word sequence in the document. For example, given the input:

  I wish I may I wish I might

You might generate:

    "I wish" => ["I", "I"]
"wish I" => ["may", "might"]
"may I"  => ["wish"]
"I may"  => ["I"]

This says that the words "I wish" are twice followed by the word "I", the words "wish I" are followed once by "may" and once by "might" and so on.

To generate new text from this analysis, choose an arbitrary word pair as a starting point. Use these to look up a random next word (using the table above) and append this new word to the text so far. This now gives you a new word pair at the end of the text, so look up a potential next word based on these. Add this to the list, and so on. In the previous example, we could start with "I may". The only possible next word is "I", so now we have:

   I may I

The last two words are "may I", so the next word is "wish". We then look up "I wish", and find our choice is constrained to another "I".

   I may I wish I

Now we look up "wish I", and find we have a choice. Let’s choose "may".

   I may I wish I may

Now we’re back where we started from, with "I may." Following the same sequence, but choosing "might" this time, we get:

   I may I wish I may I wish I might

At this point we stop, as no sequence starts "I might."

Given a short input text, the algorithm isn’t too interesting. Feed it a book, however, and you give it more options, so the resulting output can be surprising.

For this kata, try implementing a trigram algorithm that generates a couple of hundred words of text using a book-sized file as input. Project Gutenberg is a good source of online books (Tom Swift and His Airship is here). Be warned that these files have DOS line endings (carriage return followed by newline).

#### Objectives

Kata’s are about trying something many times. In this one, what we’re experimenting with is not just the code, but the heuristics of processing the text. What do we do with punctuation? Paragraphs? Do we have to implement backtracking if we chose a next word that turns out to be a dead end?

# Kata 15 – Playing With Bits

At my son's karate school, they spend most of their time doing various exercises, with the occasional round of sparring thrown in. But every now and then the teacher finds a way to break the routine by injecting some kind of game or surprise into the mix. This kata is one of those. It's nothing serious, and unlikely to have practical benefits (unless you're working in some fairly specialized areas).

Think of binary numbers: sequences of 0's and 1's. How many n-digit binary numbers are there that don't have two adjacent 1 bits? For example, for three-digit numbers, five of the possible eight combinations meet the criteria: 000, 001, 010, 011, 100, 101, 110, 111. What is the number for sequences of length 4, 5, 10, n?

Having worked out the pattern, there's a second part to the question: can you prove why that relationship exists?

Tags:

# Code Kata 6-10 of 21

5. September 2011 05:23

I came across this site quite a while back and have been meaning to blog about it. The site is CodeKata, and the premise is essentially to give reasonably interesting problems that can be solved using code in order to practice and hone your skills. Not everything is a coding problem, but where I can see these coming in handy are using the katas to learn new languages. I always learn better when I have something to implement as I learn. Anyway, my original intention was to just link to the site, however it looks like the site hasn’t been updated in 4 years! It concerns me that it might not be around if I merely link to it, so I am going to copy the katas here for future reference. I hope Dave doesn’t mind.

# Kata 6 – Anagrams

Back to non-realistic coding this week (sorry, Martin). Let's solve some crossword puzzle clues.

In England, I used to waste hour upon hour doing newspaper crosswords. As crossword fans will know, English cryptic crosswords have a totally different feel to their American counterparts: most clues involve punning or word play, and there are lots of anagrams to work through. For example, a recent Guardiancrossword had:

  Down:
..
21. Most unusual form of arrest (6)

The hint is the phrase ‘form of,’ indicating that we’re looking for an anagram. Sure enough ‘arrest’ has six letters, and can be arranged nicely into ‘rarest,’ meaning ‘most unusual.’ (Other anagrams include raster, raters, Sartre, and starer)

A while back we had a thread on the Ruby mailing list about finding anagrams, and I’d like to resurrect it here. The challenge is fairly simple: given a file containing one word per line, print out all the combinations of words that are anagrams; each line in the output contains all the words from the input that are anagrams of each other. For example, your program might include in its output:

  kinship pinkish
enlist inlets listen silent
boaster boaters borates
fresher refresh
sinks skins
knits stink
rots sort

If you run this on the word list here you should find 2,530 sets of anagrams (a total of 5,680 words). Running on a larger dictionary (about 234k words) on my OSX box produces 15,048 sets of anagrams (including all-time favorites such as actaeonidae/donatiaceae, crepitant/pittancer, and (for those readers in Florida) electoral/recollate).

For added programming pleasure, find the longest words that are anagrams, and find the set of anagrams containing the most words (so "parsley players replays sparely" would not win, having only four words in the set).

#### Kata Objectives

Apart from having some fun with words, this kata should make you think somewhat about algorithms. The simplest algorithms to find all the anagram combinations may take inordinate amounts of time to do the job. Working though alternatives should help bring the time down by orders of magnitude. To give you a possible point of comparison, I hacked a solution together in 25 lines of Ruby. It runs on the word list from my web site in 1.5s on a 1GHz PPC. It’s also an interesting exercise in testing: can you write unit tests to verify that your code is working correctly before setting it to work on the full dictionary.

# Kata 7 – How’d I Do?

The last couple of kata have been programming challenges; let’s move back into mushier, people-oriented stuff this week.

This kata is about reading code critically—our own code. Here’s the challenge. Find a piece of code you wrote last year sometime. It should be a decent sized chunk, perhaps 500 to 1,000 lines long. Pick code which isn’t still fresh in your mind.

Now we need to do some acting. Read through this code three times. Each time through, pretend something different. Each time, jot down notes on the stuff you find.

• The first time through, pretend that the person who wrote this code is the best programmer you know. Look for all the examples of great code in the program.
• The second time through, pretend that the person who wrote this code is the worst programmer you know. Look for all the examples of horrible code and bad design.
• The third (and last) time though, pretend that you’ve been told that this code contains serious bugs, and that the client is going to sue you to bankruptcy unless you fix them. Look for every potential bug in the code.

Now look at the notes you made. What is the nature of the good stuff you found? Would you find similar good stuff in the code you’re writing today. What about the bad stuff; are similar pieces of code sneaking in to your current code too. And finally, did you find any bugs in the old code? If so, are any of them things that that you’d want to fix now that you’ve found them. Are any of them systematic errors that you might still be making today?

#### Moving Forward By Looking Back

Perhaps you’re not like me, but whenever I try this exercise I find things that pleasantly surprise me and things that make me cringe in embarrassment. I find the occasional serious bug (along with more frequent less but serious issues). So I try to make a point of looking back at my code fairly frequently.

However, doing this six months after you write code is not the best way of developing good software today. So the underlying challenge of this kata is this: how can we get into the habit of critically reviewing the code that we write, as we write it? And can we use the techniques of reading code with different expectations (good coder, bad coder, and bug hunt) when we’re reviewing our colleagues code?

# Kata 8 – Conflicting Objectives

Why do we write code? At one level, we’re trying to solve some particular problem, to add some kind of value to the world. But often there are also secondary objectives: the code has to solve the problem, and it also has to be fast, or easy to maintain, or extend, or whatever. So let’s look at that.

For this kata, we’re going to write a program to solve a simple problem, and we’re going to write it with three different sub-objectives. Our program is going do process the dictionary we used in previous kata, this time looking for all six letter words which are composed of two concatenated smaller words. For example:

  al + bums => albums
bar + ely => barely
be + foul => befoul
con + vex => convex
here + by => hereby
jig + saw => jigsaw
tail + or => tailor
we + aver => weaver

Write the program three times.

• The first time, make program as readable as you can make it.
• The second time, optimize the program to run fast fast as you can make it.
• The third time, write as extendible a program as you can.

Now look back at the three programs and think about how each of the three subobjectives interacts with the others. For example, does making the program as fast as possible make it more or less readable? Does it make easier to extend? Does making the program readable make it slower or faster, flexible or rigid? And does making it extendible make it more or less readable, slower or faster? Are any of these correlations stronger than others? What does this mean in terms of optimizations you may perform on the code you write?

# Kata 9 – Back to the Checkout

Back to the supermarket. This week, we’ll implement the code for a checkout system that handles pricing schemes such as "apples cost 50 cents, three apples cost $1.30." Way back in KataOne we thought about how to model the various options for supermarket pricing. We looked at things such as "three for a dollar," "$1.99 per pound," and "buy two, get one free."

This week, let’s implement the code for a supermarket checkout that calculates the total price of a number of items. In a normal supermarket, things are identified using Stock Keeping Units, or SKUs. In our store, we’ll use individual letters of the alphabet (A, B, C, and so on). Our goods are priced individually. In addition, some items are multipriced: buy n of them, and they’ll cost you y cents. For example, item ‘A’ might cost 50 cents individually, but this week we have a special offer: buy three ‘A’s and they’ll cost you $1.30. In fact this week’s prices are:  Item Unit Special Price Price -------------------------- A 50 3 for 130 B 30 2 for 45 C 20 D 15 Our checkout accepts items in any order, so that if we scan a B, an A, and another B, we’ll recognize the two B’s and price them at 45 (for a total price so far of 95). Because the pricing changes frequently, we need to be able to pass in a set of pricing rules each time we start handling a checkout transaction. The interface to the checkout should look like:  co = CheckOut.new(pricing_rules) co.scan(item) co.scan(item) : : price = co.total Here’s a set of unit tests for a Ruby implementation. The helper method price lets you specify a sequence of items using a string, calling the checkout’s scan method on each item in turn before finally returning the total price.  class TestPrice < Test::Unit::TestCase def price(goods) co = CheckOut.new(RULES) goods.split(//).each { |item| co.scan(item) } co.total end def test_totals assert_equal( 0, price("")) assert_equal( 50, price("A")) assert_equal( 80, price("AB")) assert_equal(115, price("CDBA")) assert_equal(100, price("AA")) assert_equal(130, price("AAA")) assert_equal(180, price("AAAA")) assert_equal(230, price("AAAAA")) assert_equal(260, price("AAAAAA")) assert_equal(160, price("AAAB")) assert_equal(175, price("AAABB")) assert_equal(190, price("AAABBD")) assert_equal(190, price("DABABA")) end def test_incremental co = CheckOut.new(RULES) assert_equal( 0, co.total) co.scan("A"); assert_equal( 50, co.total) co.scan("B"); assert_equal( 80, co.total) co.scan("A"); assert_equal(130, co.total) co.scan("A"); assert_equal(160, co.total) co.scan("B"); assert_equal(175, co.total) end end There are lots of ways of implementing this kind of algorithm; if you have time, experiment with several. #### Objectives of the Kata To some extent, this is just a fun little problem. But underneath the covers, it’s a stealth exercise in decoupling. The challenge description doesn’t mention the format of the pricing rules. How can these be specified in such a way that the checkout doesn’t know about particular items and their pricing strategies? How can we make the design flexible enough so that we can add new styles of pricing rule in the future? # Kata 10 – Hashes vs. Classes If we’re programming business applications in a language such as Java or C#, we’re used to constructing and using classes to manipulate our business objects. Is this always the right way to go, or would a less formal approach serves us well sometimes? This week’s topic doesn’t involve a coding challenge. Instead, we’re thinking about design and tradeoffs. Imagine that you’ve been asked to write an export utility for a large and complex database. The export has to read data from 30 or so tables (perhaps 100 columns are potentially written to each export record). Some of the exported data is written exactly as read from the database, but other exported data must be calculated. In addition, if certain flag fields have specific values, then additional data must be read from the database to complete an export row. The export data must obviously be correct, but the client is also asking for a flexible solution; their world changes a lot. One solution is to use existing business objects and existing persistence mechanisms, and to use higher-level classes to aggregate their results into a form that can be used to generate export rows. This higher level object could perform the calculations necessary for the virtual fields, and read in additional business objects if the flag fields dictate. An alternative solution might be to read the data row at a time into a Hash using ad-hoc queries, keying the hash on the field names. A separate pass could then be made to perform any necessary calculations, storing the results back in to the same hash. Additional data could be read from the database if the flag fields are set, again storing the results in the hash. The contents of the hash are then used to write the export record, and we loop back to do the next row. This kata is a thought experiment. What are the top three advantages and top three disadvantages of the two approaches? If you’re been using classes to hold data in your business applications, what would the impact be if you were to switch to hashes, and vice versa? Is this issue related to the static/dynamic typing debate? Tags: # Code Kata 1–5 of 21 5. September 2011 05:20 I came across this site quite a while back and have been meaning to blog about it. The site is CodeKata, and the premise is essentially to give reasonably interesting problems that can be solved using code in order to practice and hone your skills. Not everything is a coding problem, but where I can see these coming in handy are using the katas to learn new languages. I always learn better when I have something to implement as I learn. Anyway, my original intention was to just link to the site, however it looks like the site hasn’t been updated in 4 years! It concerns me that it might not be around if I merely link to it, so I am going to copy the katas here for future reference. I hope Dave doesn’t mind. # Kata 1 – Supermarket Pricing This kata arose from some discussions we’ve been having at the DFW Practioners meetings. The problem domain is something seemingly simple: pricing goods at supermarkets. Some things in supermarkets have simple prices: this can of beans costs$0.65. Other things have more complex prices. For example:

• three for a dollar (so what’s the price if I buy 4, or 5?)
• \$1.99/pound (so what does 4 ounces cost?)
• buy two, get one free (so does the third item have a price?)

This kata involves no coding. The exercise is to experiment with various models for representing money and prices that are flexible enough to deal with these (and other) pricing schemes, and at the same time are generally usable (at the checkout, for stock management, order entry, and so on). Spend time considering issues such as:

• does fractional money exist?
• when (if ever) does rounding take place?
• how do you keep an audit trail of pricing decisions (and do you need to)?
• are costs and prices the same class of thing?
• if a shelf of 100 cans is priced using "buy two, get one free", how do you value the stock?

This is an ideal shower-time kata, but be careful. Some of the problems are more subtle than they first appear. I suggest that it might take a couple of weeks worth of showers to exhaust the main alternatives.

#### Goal

The goal of this kata is to practice a looser style of experimental modelling. Look for as many different ways of handling the issues as possible. Consider the various tradeoffs of each. What techniques use best for exploring these models? For recording them? How can you validate a model is reasonable?

##### What’s a Code Kata?

As a group, software developers don’t practice enough. Most of our learning takes place on the job, which means that most of our mistakes get made there as well. Other creative professions practice: artists carry a sketchpad, musicians play technical pieces, poets constantly rewrite works. In karate, where the aim is to learn to spar or fight, most of a student’s time is spent learning and refining basic moves. The more formal of these exercises are called kata.

To help developers get the same benefits from practicing, we’re putting together a series of code kata: simple, artificial exercises which let us experiment and learn without the pressure of a production environment. Our suggestions for doing the kata are:

• find a place and time where you won’t be interrupted
• focus on the essential elements of the kata
• remember to look for feedback for every major decision
• if it helps, keep a journal of your progress
• have discussion groups with other developers, but try to have completed the kata first

There are no right or wrong answers in these kata: the benefit comes from the process, not from the result.

# Kata 2 – Karate Chop

A binary chop (sometimes called the more prosaic binary search) finds the position of value in a sorted array of values. It achieves some efficiency by halving the number of items under consideration each time it probes the values: in the first pass it determines whether the required value is in the top or the bottom half of the list of values. In the second pass in considers only this half, again dividing it in to two. It stops when it finds the value it is looking for, or when it runs out of array to search. Binary searches are a favorite of CS lecturers.

This Kata is straightforward. Implement a binary search routine (using the specification below) in the language and technique of your choice. Tomorrow, implement it again, using a totally different technique. Do the same the next day, until you have five totally unique implementations of a binary chop. (For example, one solution might be the traditional iterative approach, one might be recursive, one might use a functional style passing array slices around, and so on).

#### Goals

This Kata has three separate goals:

1. As you’re coding each algorithm, keep a note of the kinds of error you encounter. A binary search is a ripe breeding ground for "off by one" and fencepost errors. As you progress through the week, see if the frequency of these errors decreases (that is, do you learn from experience in one technique when it comes to coding with a different technique?).
2. What can you say about the relative merits of the various techniques you’ve chosen? Which is the most likely to make it in to production code? Which was the most fun to write? Which was the hardest to get working? And for all these questions, ask yourself "why?".
3. It’s fairly hard to come up with five unique approaches to a binary chop. How did you go about coming up with approaches four and five? What techniques did you use to fire those "off the wall" neurons?

#### Specification

Write a binary chop method that takes an integer search target and a sorted array of integers. It should return the integer index of the target in the array, or -1 if the target is not in the array. The signature will logically be:

   chop(int, array_of_int)  -> int

You can assume that the array has less than 100,000 elements. For the purposes of this Kata, time and memory performance are not issues (assuming the chop terminates before you get bored and kill it, and that you have enough RAM to run it).

##### Test Data

Here is the Test::Unit code I used when developing my methods. Feel free to add to it. The tests assume that array indices start at zero. You’ll probably have to do a couple of global search-and-replaces to make this compile in your language of choice (unless your enlightened choice happens to be Ruby).

  def test_chop
assert_equal(-1, chop(3, []))
assert_equal(-1, chop(3, [1]))
assert_equal(0,  chop(1, [1]))
#
assert_equal(0,  chop(1, [1, 3, 5]))
assert_equal(1,  chop(3, [1, 3, 5]))
assert_equal(2,  chop(5, [1, 3, 5]))
assert_equal(-1, chop(0, [1, 3, 5]))
assert_equal(-1, chop(2, [1, 3, 5]))
assert_equal(-1, chop(4, [1, 3, 5]))
assert_equal(-1, chop(6, [1, 3, 5]))
#
assert_equal(0,  chop(1, [1, 3, 5, 7]))
assert_equal(1,  chop(3, [1, 3, 5, 7]))
assert_equal(2,  chop(5, [1, 3, 5, 7]))
assert_equal(3,  chop(7, [1, 3, 5, 7]))
assert_equal(-1, chop(0, [1, 3, 5, 7]))
assert_equal(-1, chop(2, [1, 3, 5, 7]))
assert_equal(-1, chop(4, [1, 3, 5, 7]))
assert_equal(-1, chop(6, [1, 3, 5, 7]))
assert_equal(-1, chop(8, [1, 3, 5, 7]))
end


# Kata 3 – How Big, How Fast?

Rough estimation is a useful talent to possess. As you’re coding away, you may suddenly need to work out approximately how big a data structure will be, or how fast some loop will run. The faster you can do this, the less the coding flow will be disturbed.

So this is a simple kata: a series of questions, each asking for a rough answer. Try to work each out in your head. I’ll post my answers (and how I got them) in a week or so.

##### How Big?
• roughly how many binary digits (bit) are required for the unsigned representation of:
• 1,000
• 1,000,000
• 1,000,000,000
• 1,000,000,000,000
• 8,000,000,000,000
• My town has approximately 20,000 residences. How much space is required to store the names, addresses, and a phone number for all of these (if we store them as characters)?
• I’m storing 1,000,000 integers in a binary tree. Roughly how many nodes and levels can I expect the tree to have? Roughly how much space will it occupy on a 32-bit architecture?
##### How Fast?
• My copy of Meyer’s Object Oriented Software Construction has about 1,200 body pages. Assuming no flow control or protocol overhead, about how long would it take to send it over an async 56k baud modem line?
• My binary search algorithm takes about 4.5mS to search a 10,000 entry array, and about 6mS to search 100,000 elements. How long would I expect it to take to search 10,000,000 elements (assuming I have sufficient memory to prevent paging).

# Kata 4 – Data Munging

Martin Fowler gave me a hard time for KataTwo, complaining that it was yet another single-function, academic exercise. Which, or course, it was. So this week let’s mix things up a bit.

Here’s an exercise in three parts to do with real world data. Try hard not to read ahead—do each part in turn.

##### Part One: Weather Data

In weather.dat you’ll find daily weather data for Morristown, NJ for June 2002. Download this text file, then write a program to output the day number (column one) with the smallest temperature spread (the maximum temperature is the second column, the minimum the third column).

##### Part Two: Soccer League Table

The file football.dat contains the results from the English Premier League for 2001/2. The columns labeled ‘F’ and ‘A’ contain the total number of goals scored for and against each team in that season (so Arsenal scored 79 goals against opponents, and had 36 goals scored against them). Write a program to print the name of the team with the smallest difference in ‘for’ and ‘against’ goals.

##### Part Three: DRY Fusion

Take the two programs written previously and factor out as much common code as possible, leaving you with two smaller programs and some kind of shared functionality.

#### Kata Questions

• To what extent did the design decisions you made when writing the original programs make it easier or harder to factor out common code?
• Was the way you wrote the second program influenced by writing the first?
• Is factoring out as much common code as possible always a good thing? Did the readability of the programs suffer because of this requirement? How about the maintainability?

# Kata 5 – Bloom Filters

There are many circumstances where we need to find out if something is a member of a set, and many algorithms for doing it. If the set is small, you can use bitmaps. When they get larger, hashes are a useful technique. But when the sets get big, we start bumping in to limitations. Holding 250,000 words in memory for a spell checker might be too big an overhead if your target environment is a PDA or cell phone. Keeping a list of web-pages visited might be extravagant when you get up to tens of millions of pages. Fortunately, there’s a technique that can help.

Bloom filters are a 30-year-old statistical way of testing for membership in a set. They greatly reduce the amount of storage you need to represent the set, but at a price: they’ll sometimes report that something is in the set when it isn’t (but it’ll never do the opposite; if the filter says that the set doesn’t contain your object, you know that it doesn’t). And the nice thing is you can control the accuracy; the more memory you’re prepared to give the algorithm, the fewer false positives you get. I once wrote a spell checker for a PDP-11 which stored a dictionary of 80,000 words in 16kbytes, and I very rarely saw it let though an incorrect word. (Update: I must have mis-remembered these figures, because they are not in line with the theory. Unfortunately, I can no longer read the 8" floppies holding the source, so I can’t get the correct numbers. Let’s just say that I got a decent sized dictionary, along with the spell checker, all in under 64k.)

Bloom filters are very simple. Take a big array of bits, initially all zero. Then take the things you want to look up (in our case we’ll use a dictionary of words). Produce ‘n’ independent hash values for each word. Each hash is a number which is used to set the corresponding bit in the array of bits. Sometimes there’ll be clashes, where the bit will already be set from some other word. This doesn’t matter.

To check to see of a new word is already in the dictionary, perform the same hashes on it that you used to load the bitmap. Then check to see if each of the bits corresponding to these hash values is set. If any bit is not set, then you never loaded that word in, and you can reject it.

The Bloom filter reports a false positive when a set of hashes for a word all end up corresponding to bits that were set previously by other words. In practice this doesn’t happen too often as long as the bitmap isn’t too heavily loaded with one-bits (clearly if every bit is one, then it’ll give a false positive on every lookup). There’s a discussion of the math in Bloom filters at www.cs.wisc.edu/~cao/papers/summary-cache/node8.html.

So, this kata is fairly straightforward. Implement a Bloom filter based spell checker. You’ll need some kind of bitmap, some hash functions, and a simple way of reading in the dictionary and then the words to check. For the hash function, remember that you can always use something that generates a fairly long hash (such as MD5) and then take your smaller hash values by extracting sequences of bits from the result. On a Unix box you can find a list of words in /usr/dict/words (or possibly in /usr/share/dict/words). For others, I’ve put a word list up at pragprog.com/katadata/wordlist.txt.

Play with using different numbers of hashes, and with different bitmap sizes.

Part two of the exercise is optional. Try generating random 5-character words and feeding them in to your spell checker. For each word that it says it OK, look it up in the original dictionary. See how many false positives you get.

Tags:

# XML Config Transformations in Visual Studio and SlowCheetah

24. August 2011 22:13

Xml config transformations are an easy way to provide consistent changes to web.config files when deploying applications to different environments. However, until SlowCheetah came out this useful transformation was limited to actual deploys. SlowCheetah provides the ability to complete transformations when Run/Debugging in Visual Studio on any file and could come in quite handy. Between SlowCheetah and T4 templates, maybe someone can come up with a way to have Visual Studio just code the whole app for you!

Tags:

# PsExec

22. August 2011 14:33

Scott Hanselman recently posted an article in which he used PsExec to open regedit running as the System user (think administrator on steroids). This allowed him to delete registry entries that aren’t normally accessible. The cool thing about this tool is the remoting capabilities. I’m about to give it a shot here at work to see if it is quicker than using remote desktop for administering remote servers.

Utilities like Telnet and remote control programs like Symantec's PC Anywhere let you execute programs on remote systems, but they can be a pain to set up and require that you install client software on the remote systems that you wish to access. PsExec is a light-weight telnet-replacement that lets you execute processes on other systems, complete with full interactivity for console applications, without having to manually install client software. PsExec's most powerful uses include launching interactive command-prompts on remote systems and remote-enabling tools like IpConfig that otherwise do not have the ability to show information about remote systems.

Tags: