Comp200 lecture -- NP

Comp 200 Lecture
Hard Problems (NP)

Recap

Last time, before that long spring break:
We saw a very nice, complete picture for sorting: Mergesort requires n log(n) comparisons; AND we saw that ANY (comparison-based) sort requires 1/3 n log(n) comparisons. So nobody will ever do better than a factor of 3 over mergesort. [Note: "1/3 n log(n)" was a weak estimate; using Stirling's approximation we could have gotten that any sort requires n (log(n) - 1.5) comparisons.]

And we saw other problems: "Well-behaved" jigsaw puzzles with n pieces can be solved with n² attempted pieces (though in practice you like to do even better). As compared to the monkey-puzzle, which our only algorithm is not much more than trying-all-possibilites (in some organized way), which might require up to n! attempts, for all we know.
Are ther other clever people who know a better way to do the monkey puzzle? Consider the fact that you buy (well-behaved) jigsaws with several thousand pieces, while monkey-puzzles come with about 15 pieces. That's evidence that other people don't know a better approach!

Looking for boundaries

The question becomes for the monkey puzzle, the same question we had for sorting: Can we possibly do better, or is trying-all-possiblities about the best that can be done? (That is, can we have some argument to show that any algorithm requires proportional to n! steps?
Answer: Nobody knows. (If you can find a better method, or if you can prove that all algorithms indeed requrie approx. n! steps, you'll be a famous computer scientist!)

Some other difficult problems:

Traveling Salesperson ("TSP"):
- Input A list of cities, and cost to travel between any pair.
  (You can go from any city directly to any other, though cost may be high.)
- Problem: Give a cheapest-cost tour which visits each city exactly once (finishing in the start-city.)
Discussion: Is this really a tough problem?
- What is running time of a naive algorithm which is sure to work?
- What about the simple, quick algorithm "go to nearest unvisited city?" This sounds reasonable, but I claim it doesn't always give the best answer. [Homework problem: find a counterexample: a map where that algorithm doesn't result in the shortest path!]
Hamiltonian Path ("HamPath")
- Input: A list of cities, some of which are connected, some not.
  (Airline flights, perhaps.)
- Problem Is there a tour which visits every city, w/o repeating? (finishing in the start-city.)
  (e.g. you're a museum thief, and don't want to re-enter a city.)
Although this may not sound difficult, If i give you a map with 1000 cities, it can be difficult to find a hamiltonian path; if you're not careful you fly NYC to IAH to rochester, and find that the only flight out of rochester is to NYC, even though you haven't toured everything. Avoiding such dead ends is difficult, because when you reach one, it's not clear which past decision led you wrong.

Reducing Problems

Is one of these problems (TSP, HamPath) easier than the other? Yes -- we'll show that:
HamPath <= TSP.
What do we mean when we write one problem as less than another?

"Solving HamPath is not more difficult than solving TSP", that is:
"HamPath can be reduced to TSP", that is:
"If i have a friend who can solve TSP, then i can set up a consulting agency to solve HamPath."

Okay, details: Exactly how do we reduce HamPath to TSP?
That is, I set up Ye Olde HamPath Solving Shoppe. Somebody brings me a tough instance of a HamPath problem; i scratch my head, but i remember that i can run out the back door to my friend's Ye Olde TSP Solving Shoppe.

Here's my algorithm: I take the map I was given (which just showed whether certain cities had direct flights to another, but didn't mention any costs), and i convert it to a TSP instance:

How can i add costs between all pairs of cities, so that a cheapest-cost tour on my modified map corresponds to whether or not there is an airline-tour that doesn't repeat a city?
Answer: For cities connected by airline flights, give them a cheap cost, $1; for cities NOT connected by airline, say there's a huge cost, say $100000. Now, turn this modified map over to the TSP solver, and find the cheapest cost. If the cheapest TSP is less than $100000, i know that there is a way to tour the cities w/o repetition, just using airline routes. (And the TSP tour also doubles as a hamiltonian path.) But if no TSP costs less than $1000000, then I know that there is no tour which avoids a non-airline-link.

Classifying Problems

Computer scientists see how some problems are feasible to solve (sorting, jigsaws), and others don't seem to be feasible: Monkey-puzzle, TSP, HamPath. Two broad categories:

P: the set of all problems solvable in polynomial time. This is arguably considered "efficient".
NP: the set of all problems for which a purported solution can be verified in polynomial time.
Example: A purported Monkey-puzzle solution Jigsaw can be easily checked for correctness (requires about 4n checks, to make sure the piece-colors really do match up), so Monkey-puzzle is in NP.
(Actually, you have to work to find a problem so difficult it's not in NP.)

Note that P is necessarily contained in NP (What does that mean, in terms of "...a problem efficiently solvable" and "...a problem with efficiently checkable solutions"? Why is the statement true?).

[Monday's lecture ended here.]

Monkeys vs Hams

What about the monkey puzzle: is that easier or harder than HamPath? If we're clever, we can indeed reduce HamPath to Monkey, meaning that HamPath is really no more difficult than Monkey is.

So what does "reduce HamPath to MonkeyPuzzle" mean, again? We want to go into business solving the HamPath problem, but we're assuming that we have a neighboring shop which solves the Monkey Puzzles, to whom we can subcontract out. That is, given a HamPath problem, let's create monkey-puzzle which is somehow "equivalent" (a solution to one puzzle can be viewed as a solution to the other).

So, a customer gives us a HamPath map with n cities (say, A, B, C, K, T, W; n=6). we'll create a monkey puzzle which uses one color per city (think "chicago orange", "new york orange", "houston orange", etc.), plus the colors blue and pink. It will be a long skinny monkey puzzle: 1-by-n (or maybe 1-by-3n or such).

What monkey-tiles do we make? Ones with adjacent city colors on left and right, and blue on the top and bottom. That is, if A and W are connected, then we'll make a tile "AW" on the left/right (and blue on top/bottom). If A is also connected to C, then we'll also make a similar tile "AC". Notice -- a solved puzzle will then just be a sequence reading a HamPath from left to right, perhaps:

        AW, WC, CB, BK, KT, TA.

where A is adjacent to W, W to C, etc.
A number of problems to iron out:

There are more tiles than cities; we need to fit the extra tiles in somewhere. (E.g., the tile AC above, since we didn't happen to use the link from A to C in the hamiltonian circuit.)
- No prob; just have them at the end, buffered with tiles which have just a single-name on left and pink on right, and vice versa. So a solution really would look like
```
        AW WC CB BK KT TA;   PA AC CP,  PB BT TB, ...
```
where P stands for Pink. Note we can tell in advance, exactly how many of these pink tiles we need to create.
So the first few (pre-pink) tiles of the sequence correspond to the solution; we need to make sure that this solution really is n long:
- No prob; those pink tiles also have pink on top/bottom, and our frame has blue on the top/bottom at first, but starts having pink on the top/bottom at the n+1st tile (What's the next frame position to have pink?)
We need to make sure each city visited exactly once, starting and ending with (say) A. (That is, don't allow an n-long sequence like AB, BA, AC, CA, ....)
- Okay, we'll add more tiles, exactly one for each city, corresponding to "this city visited". The new tile has the same city color on the left and the right We'll insert them in the path, so the tiles look like:
```
                AW, WW, WC, CC, CB, BB, ..., KK, KA; (the pink residue).
```
We can make the top/bottom of these "city visited" yet another color (fuscia), to ensure they all sit in the first part of the monkey solution, and can never be put in the pink residue.

We take this monkey puzzle to our friend's shop, have them solve it, and when they hand back the solved problem we just read off the solution. Note that to be sure this scheme works correctly, we'd have to prove two things:

A solution to the monkey puzzle really does correspond to a meaningful Hamiltonian Path in the original map,
A Hamiltionian Path really does give rise to a legal solution to the monkey-puzzle we created.

Sidenote: The above reduction creates monkey-puzzles with large numbers of colors -- n+3 or so. Is the many-colored-monkey-puzzle problem harder or easier than a stricter version, where you are only allowed to use, say, five colors? How would you prove this? What is the minimum number of colors needed? [Answering these questions could be the bulk of your class-project.]

Turnabout is Fair Play

Thus, even though they seemed like unrelated problems, we have reduced HamPath to MonkeyPuzzle: any HamPath problem is really just type of MonkeyPuzzles in disguise.

How about the other way? As it turns out, Yes!, we can actually reduce MonkeyPuzzle to HamPath.

This reduction is more difficult than the one just shown: We're given a monkey-puzzle, and we need to create a map, where a hamiltionian circuit corresponds to a solution to our original monkey-puzzle. It involves:

Creating a city for each tile-in-a-certain-location. That is, there will be a city with the name "Tile-G at location 3,2", and another city with the name "Tile-G at location 4,7", etc. (How many cities? n² of them, so far!)
Note that a hamiltonian path will go through all of the above cities; we need to concoct some way to figure out how to make a path correspond to an legal monkey-solution. To that end, we'll make further cities which say "the city we visit next corresponds to a real tile location", and more which say "the city we visit next does not correspond to a real tile location".
The hard part is connecting these cities in a clever way, so that any hamiltonian path ends up indicating that exactly one of the "Tile-G at location ..." cities is indicated as true, and all the other "Tile-G location ..." cities are indicated as not true.

Again, it has to be argued that a HamPath corresponds to a legal solution to the corresponding monkey puzzle, and vice versa. [Working through the details of this would make a fine class project (I'd specify the reduction, but you show how it's correct; I'd provide help with you on this.)]

Equally Difficult Problems (NPC)

Hey, what does it mean that HamPath and Monkey both reduce to each other? It means they're essentially equivalently difficult -- or put differently, the two problems are really just disguised versions of each other! As it turns out, nobody's been able to find a good alg for solving either.

Hold on to your seat, we're about to get more abstract:
Fact: Any problem with efficient-checking-of-proposed-solution can be reduced to MonkeyPuzzle!
Wow, this is a very strong statement: any problem with that property, even problems which haven't been thought up yet. (This was proved by Steve Cook, 1972. Possible project: understand this result. Quick insight: use monkey-tiles to simulate scheme expressions, with tiles like "apply function" and "the placeholder sum"; the bottom row of the monkey puzzle is the initial problem, and every row above it, in a valid monkey puzzle, would correspond to a single step of the stepper.)

We say that Monkey-puzzle is NP-complete ("NPC"), since any problem in NP reduces to it. (It's as tough as any problem in NP.) Note that finding an efficient solution to MonkeyPuzzle means finding an efficient solution to any problem in NP!

What happens when you put the above fact together with MonkeyPuzzle <= HamPath? Yes -- Any problem with efficient-checking-of-proposed-solution can also be reduced to HamPath. (The notation "<=" is suggestive of this.) There are many NP-complete problems -- including these two and TSP; finding a way to efficiently solve any of these problems will mean an efficient way to solve all of them!
[Ian: Pass around Garey&Johnson, with its 90 pages of terse problem descriptions.]

P vs NP

Is P = NP? That is, checking a proposed soln is easier than finding soln? We can't prove it's more difficult (or, that it's not more difficult)! Cf. Simpson's halloween episode, where he enters 3-d world; one of the floating statement is "P=NP", as well as a purported counterexample to fermat's last th'm.

It is not known whether P = NP -- that is, if any problem where a solution is easily checkable also means it's easy to solve. (Seems like it should be harder to solve a problem than to just verify a proposed solution, but we can't prove this.) A very embarrassing lack of knowledge.

Note that there is one big difference between Monkey-Puzzle, HamPath, etc, and our discussion of insert-sort, merge-sort, etc.: Here we are talking about the problem, and not a particular algorithm for the problem!

Beyond NP

You might be wondering, what is a problem which isn't in NP? Can't all purported answers always be efficiently checked? As it turns out, no:

Primes A problem which is not obviously in NP: Primes. If i claim that a number is prime, can you efficiently verify my solution? Not obviously: I give you the number 7654321, and you need to do about 7654321 divisions If you're clever, you don't need to do that many divisions; after checking for division-by-2, you don't need to try any other even numbers, cutting your work in half. There are several more things you can do, to find the answer in even fewer divisions, maybe much less than n, but it will probably be much more than polynomial-in-the-number-of-digits-of-n (See here). If you can do this, you've solved a major problem in mathematics! to determine whether the number is prime. Not an efficient algorithm. You might think "Gee, given a number n, I find the answer in fewer than n steps -- that seems efficient!" But I'll claim the size of the problem i handed you is really the number-of-digits of n. Think of it as: I can, in a mere 30 keystrokes, give you a problem that takes about 10³⁰ steps for you to solve! If i claim a number can be factored (is non-prime, or "composite" Note that 1 is considered neither prime nor composite, just like 0 is considered neither positive nor negative. ), then you can quickly verify my purported factoring. That is, you can efficiently verify claims of "no-answer"; this means that Prime is in a class called "co-NP".
However: Pratt, 1975, showed that primes do have short certificates(!) (Non-obvious.)