Interesting Use of Number Theory in URL Shortener
When You Consider Infinity, The Solution Is Very Practical
Assigning a Unique Number to a Unique String:
There are 95 printable ASCII characters. That means every word, sentence, paragraph, essay, poem, book, legal document, constitutions, everything ever written in modern English can be represented by combinations of these 95 symbols.
And here’s another point, combinations of these 95 symbols will hold all the literature that will ever be written in English in the future. That includes an infinite amount of gibberish that has no meaning too.
Here’s an interesting contradiction. The combinations are infinite, but they are countable. Yes, infinity can be countable.
Let’s examine this using a bit of computer science. Those 95 characters are what strings are made of. At present, computers cannot deal with numbers any larger than 264. So, assuming a string element taking up single byte, it can hold at most a string of length 264, although practical limit will be much much lower.
Think of the series below:
195 + 295 + 395 + 495 + 595 + ……….. + (264)95
The above sum is the total number of strings theoretically representable in a computer. The sum itself is a number which is much larger than 264, which is the largest number a computer can do calculations with. It’s a series with a HUGE number..But let’s get some sense of scale here.
Consider the last number of that series. It is very important, because it will contain every other strings as substrings. The last number here is (264)95 or 26080. Most web calculators, including Google’s do not calculate any value larger than 21023. It is not because Google cannot calculate it, but it is an arbitrary limit honoring standardized rules.
21023 is equal to about 8.9e307. The number 6080 is 6 roughly times bigger than 1023, so, 26080 will be about 496981(8.9^6) multiplied by 10^1842(6 * 307). The number is going to be very roughly around 4.9e1848. The number will take up about 6*1024 bits or 6144 bits. How large is that number? It is larger than the number Googol(10100). The total number of particles in the universe is only 1080.
So, yes, we are dealing with pretty big numbers here. But the point of the above exercise is that, the number is finite and countable. And, that means we can assign a unique number to each of the all possible strings.
Now, we can apply that to something more practical. We know, there are many many URLs out there. And every URL is unique. To assign a unique number to them, we can’t count each one of and assign a number to them incrementally. Say, you are creating a URL shortener. You’ll be given a URL and you have to shorten it. But you have to keep track of the original URL with an ID.
A practical solution would be to calculate a unique number for that specific URL, and then create a string hash from it.
What is the Hashing Technique?
URLs contain printable characters only. Although the number of possible URL is consistently on the rise, it is still far below the 264 level. In fact, we can very well use 48 bits and still have room for all the past and present URLs and URLs that will be generated in foreseeable future.
Every symbol in the URL has an ASCII value. Initialize the hash number to zero, and then multiply the hash number with a prime number and add to it the ASCII number of the character and keep doing it over the same hash number variable for the whole URL string.
hash = 0
prime = 31
foreach char in URL:
hash = hash * prime + int(char)
return hash
Here int(char) gives the ASCII value of the character.
Since, every URL has a unique sequence of character, this process guarantees a unique number for each URL.
How To Get the String Out of the Hash Number?
Now that we have a unique number representing a URL, let’s try to assign a unique smaller string to represent the number itself.
If you remember, we are using 48 bits to represent our number. We are quite familiar with binary, octal, decimal and hexadecimal number systems. Now, we are going to use a Base64 number system.
The Base64 number system needs 64 symbols to represent its digits. Each symbols uniquely represent a number between 0 to 63. English has 26 letters. When you consider capital and small letters, we get 52 symbols. Add to the digits 0 to 9, we get 62 symbols. Now, for the remaining two symbols we can use % and = signs. Randomize this symbols and consider a symbol’s serial position(on a 0 based index) as the number it represents.
The number 64 needs 6 bits to represent it. So, we consider 6 bits a chunk of the 48 bit number separately, and assign a symbol according to the number each 6 bits represent. There will 8 chunks of 6 bits, so the string representation will be 8 character long.
And this is how you can represent a URL with shortened fix length string in a URL shortener. So you see from thinking of the infinite we have deduced a very specific solution for an everyday problem.