The other day I was talking to my friends about package managers, particularly how
stack, one of Haskell’s package managers, and
cargo, Rust’s package manager, work in regards to how they choose which packages to install to build a project and some of the tradeoffs. What I loved about this was being able to touch on some of the CS (Computer Science) Theory (hey wow I finally used my degree for something) involved with choosing a package as well as what that meant for them the programmer. I figured it’d be fun to do a bit of a deep dive into how package manager’s work at a theoretical level, what happens when theory meets implementation, and what those tradeoffs are, by presenting it in an informative way and demystifying something many of us depend on in our day to day. With that let’s dig into some discrete mathematics (don’t worry I’ll make sure your eyes won’t glaze over) and then cover some real world package managers and how they solve the problem of what package to install.
SAT Problem (No not the college ones)
The CS problem underlying all package managers when choosing what package versions to use is referred to as the SAT Problem or Satisfiability Problem. It asks that if we are given some set of boolean propositions, can we find values that make it evaluate to true? Let’s take a look at two quick examples “A and B” as well as “A and not A”.
A and B can either be true or false. If we plug these into our proposition “A and B” we get a set of values, “true and true,” “true and false,” “false and true,” and finally “false and false”. We only have one value that satisfies this problem, “true and true”. This means our proposition is considered satisfiable! What about our other proposition?
A can either be true or false. If we plug this into the proposition “A and not A” we get “true and not true” and “false and not false”. We have no way to satisfy our proposition and so this is considered unsatisfiable!
Okay, not bad but what happens as we add more and more propositions? It gets much harder to figure out. The SAT Problem is known as an NP-Complete
problem. In computation theory we have P and NP problems. P being a problem that can be solved in polynomial time, so things that have an upper bound like an O(nlogn) problem or solution like sorting, and NP being a problem that can be solved in Nondeterministic Polynomial time, generally speaking we have to find all the possible solutions to be sure the answer we choose is the correct one, an example of this being the famous Traveling Salesman problem. We have to find all the possible path lengths to determine which one is the shortest. Now there are ways to simplify NP problems or be smart about how we program them to save computation time, but at the end of the day the problem as a class is still NP. A simple way to look at this is P problems are fast to get an answer and fast to check that it’s correct, whereas NP problems are hard to get an answer and fast to check. A lot of thought has gone into whether we can make NP problems solvable as P problems (P=NP), because then we can solve NP problems in polynomial time. This means we can solve a a lot of important problems quickly, but also break things that depend on this property like factoring primes, which are used to create encryption keys.
I’m really oversimplifying for the sake of this article and I’m sure I’ve got some inaccuracies here so if you want to know more about P and NP problems I’d suggest looking into it more. It’s really interesting! What I do want to convey is that we have to create many if not all sets of possible package versions to find one that works, making it an NP problem. Lots of tricks can be done to make it run faster though, like caching selections that work for certain constraints for later use. However, how to speed up how a package manager selects versions is not in scope for this article. We want to look at what happens if we can or can’t select a version and what happens then.
A digression on semver and it’s conventions
For the sake of this article we’ll assume any packages talked about satisfy semver (semantic versioning) of the type x.y.z where x is a major breaking change (changing a public interface), y is a minor change (usually adding a public interface) and z is a really minor change (fixing a bug inside a function for example). Semver only specifies how to write out a versioning scheme, what the numbers mean is a bit of a fuzzy thing in programmers minds and defined more by convention than anything else. What I mentioned in this paragraph is what I mean by semver.
Picking packages is hard
When programming we pick packages based off of a version (1.4.5, 1.9, 2.8, etc.) and we then have access to those libraries in our own code. This is great! We don’t have to rewrite the same functions over and over for every program we write. The downside is that we need to select a set of packages that satisfy (see told you the theory was relevant!) the versioning constraints we supply. Let’s look at a simplified example of this so that you can understand what I mean.
Suppose we choose to use packages A and D in our code. Both A and D depend on packages B and C themselves. Package A wants to use a version of package B that is greater than or equal to version 1.0.0 and less than or equal to version 1.5.0. Package D wants to use a version of package B that is exactly 1.4.5. Package A also wants to use a version of package C that is exactly version 1.6.5 and Package D wants to use a version of package C that is greater than version 1.6.0.
What we want to do is find a version for all 4 packages that satisfy the constraints given such that we choose only 4 packages. For A and D we the programmer specified the versions we want to use, in the case of this example we chose 1.9.5 and 1.7.8 respectively. Great we have two package versions out of 4 already! How what about B? A will use B if the version is between or equal to 1.0.0 and 1.5.0, and D says it’ll only use 1.4.5. The good news is 1.4.5 is between the constraints set by A! This means our package version is 1.4.5 for B! With package C, D says it’ll use a version greater than 1.6.0, and A will only use C if it’s 1.6.5. Since 1.6.5 is greater than 1.6.0 we will choose that version!
We did it! We have 4 package versions!
- A: 1.9.5
- B: 1.4.5
- C: 1.6.5
- D: 1.7.8
Nice we solved our SAT Problem by having a set of packages we can choose to fit in our code. Now, what happens if we make D say that it’ll only use version 1.6.0 of dependency C? Oh no we have a problem now. A requires 1.6.5 and D requires 1.6.0 and now we have no way to only choose one package for that version! We can’t use A and D in our program now. This was also with only two packages as well! What if we have 30 or 40 dependencies we want to use? How would we choose the right versions? Could we even choose the right versions?
Now this is where theory hits implementation. We’ll look at
cargo at how they solve this fundamental issue in a package manager which is, “what package versions should I download to make this project work?”
Satisfying Theory vs Bending Theory
Let’s start with
stack. It’s solution to the problem is “we’ll maintain an LTS of packages tied to compiler versions.” Haskell has two problems really, minor versions bumps in GHC will break packages and your program will only compile if it satisfies the SAT problem. Rather than try to have you the programmer figure out which versions of the packages work together (which is
cabal’s approach, the first package manager Haskell had and still does),
stack solves the problem by creating an LTS that has a set of packages that all work together. So as long as package A and D are in the LTS you can use it without worry that your program will fail to compile. This is great! We solve the SAT problem for packages, we get code that compiles, and it’s tied to a compiler version so we know it all should work together!
Unfortunately, there are some flaws with this approach. What if your package is not in the LTS or was in a previous version and gets dropped? Now you can’t upgrade or you’ll have to add it as an extra dependency in the
stack manifest and hope it works. If you can you get the maintainer to upgrade it to work with the new LTS. Worst case? You fork the code yourself, patch it, and now depend on your git fork with the package manager. Less than ideal and something I ran into when I wrote Haskell professionally. Ask me about it sometime, it’s a cursed story. Other problems include increased maintainer burden on a few to make sure an LTS works together as well as dealing with outdated packages because OSS is a thing and we are all depending on free labor so we can’t expect things to be upgraded all the time.
stack as a solution satisfies the problem with a high maintenance solution, but if it does work, it works well and has a possible smaller binary (one package shared amongst all your dependencies) which as we’ll see next is not necessarily the case with
stack attempts to solve the problem with maintaining a list of known good packages, then what does
cargo do? It just says, “I’ll compile both versions in if I have to do that.” In our example
cargo would compile both version 1.6.0 and version 1.6.5 of package C. There’s quite a few advantages to this, your program doesn’t have to worry about using a newer version of a popular package because even if another one of your dependencies also uses the same package, but an older version then it’s ok! While
cargo tries to get a one package per dependency solution, it knows that it’s not always possible and so we might need to compile different versions of the package. There’s less burden of maintaining a package registry as well. We only need to provide the packages, not the packages and a list of what works well together. The result of all these advantages is that upgrading your dependencies is relatively painless, besides the usual issues of breaking changes between versions.
There must be some downside though right? Longer compile times and a possibly bigger binary produced due to having two of the same library linked in. There’s also the issue of a possible conflict arising with packages that export a type as a public interface. If both package A and package D expose a type in C that’s in both versions in the interface, then which one should the compiler use? How they work could be different. In these cases the compiler will fail to build your program. However, I can count the number of times this has happened to me over 4 years on my fingers, so overall it’s a fairly decent strategy with some unfortunate and unsolvable edge cases.
Which solution is better? Personally I’m a fan of the two package solution. Developer time is not cheap, disk space is, and I’d rather build an extra dependency and put more on disk, rather than dealing with an afternoon of lost productivity. This is my personal opinion though. You can weigh the pros and cons yourself and I’d recommend looking into it more and making an informed opinion about it.
Packaging it all up
The goal of this article was to teach you a little bit of theory and see some practical real world applications of it, as well as things that come up when theory meets the need to have an implementation. There’s more to this topic than I talked about, such as making the choice of which version to use fast. There’s all kinds of ways to make SAT problems run faster than the theory (it’s NP-Complete in the worst case) and I recommend taking the time to look it up if you’re interested. I wanted to give you an understanding of what’s going on when you hit build and what your tool is doing for you, which is a lot of work to make sure all your code works without you spending hours trying to figure out how to get the pieces working together.