On IP address based geolocation determination, bgp routing, autonomous system’s
Plug in geoip into google and you will find a smorgasbord of sites offering you ip based country determination. How exactly does something like this work? Lets take a peek into the innards of the internet.
Geolocation based ip address filtering can feel like a bit of a black box. How does a public ip indicate where in the world are you located? Until you realize…… there are private organizations (IANA, IETF, RIRs, NIRs, LIRs) tasked to allocating ipv4 and ipv6 addresses to organizations, and the country value reported by the IANA RIRs etc etc are simply imputed from the country of registration of the org assigned the range. Companies, organizations, departments, whether government or private sector, run networks after all, not countries.
Now that I think about it, in hindsight, this seems fairly obvious. It couldn’t have been any other way since we need to ensure each allocated ip address is unique, which means we will end up having a central allocation mechanism in some form or the other.
The IANA/IETF manage the global ip pool, all the way from 0/8 to 255/8. Each of these ranges is allocated to either an RIR or an autonomous network/system. An as is just a large network of routers running external bgp that operate a common routing policy different from its upstream networks and that is willing to forward traffic intended for other networks, probably with a couple of peering agreements set up. Such AS’ are usually either large orgs that are end users (Microsoft, Apple, US DoD), or large orgs that will further sub-allocate to their users (think an international ISP, telecom provider, cloud services provider operating internationally that will resell access to its network, an LIR, NIR).
Each of these RIRs/NIRs/LIRs will now further follow this same process endlessly. Either allocate ips directly to an end user like an autonomous system run by an organization or to another organization that will further sub-allocate. Luckily for us, the IANA and the RIRs all publish in the public domain as a matter of policy files specifying this list of network ranges and the country of assignment.
Since the country field you see in these public files is basically a proxy for the country of registration of the NIR / LIR / AS that received these ranges, and indicates nothing about the actual real world geographic location of the ip, if an end user is allocated ips directly from an RIR/LIR/NIR, they have a provider independent ip range in the sense they can use these ips with any isp that services the geographical region that their resource allocator manages. If, on the other hand, you receive these ranges from an isp, then you are restricted to using these ranges in areas that the isp services.
Which means if you set up servers in an area serviced by a different internet provider, you wont be able to use the previously allocated static public ip. Therefore, technically, and practically speaking, it can happen that spatially dispersed orgs assigned provider independent ips can use said ips in a country different from the country of registration of the organization that was originally assigned the ip ranges, as long as the ip is geolocated in a region served by their RIR/NIR/LIR.
We are okay with blocking legitimate traffic from blacklisted countries, largely since whatever service we are running can only be of value to whitelisted countries, and we are not looking to market in blacklisted countries. Works for services that are spatially localized for whatever reason. Could be as simple as you don’t have the supply chain to serve a customer in a blacklisted country. Could be you built a price lookup and comparison service but you don’t have the data for blacklisted countries. Ad infinitum. The tradeoff of improved server performance, lower latency and computational costs vs reduced likelihood of business by marketing to fewer eyes might be worth it to certain types of websites/businesses. Bird in hand vs two hundred in a far away bush I guess.
Such permissive rules make all the more sense when you are unsure about your cyber security, already have more than enough legitimate traffic originating from the country of interest, or the downside to a hack is very high.
If we think of the firewall as a packet classifier, we are okay with having a high false positive rate (legitimate traffic from blacklisted countries) as long as we have a ridiculously high true positive (malicious traffic from blacklisted countries) and low false negative rate (malicious traffic from whitelisted country). As little malicious traffic getting through as possible.
The downside to geoip based blocking is that legitimate users from the desired country of interest will also be blocked if they use a vpn. Apart from this, the flow described below works well enough if you have a few countries you want to whitelist. Wont work as well if you have a large whitelist. Might be faster to switch the nft rule to a blacklist then.
Theres also probably going to be a bit of a lag between when the ip allocation list has been updated, and when you update your firewall. Which means potentially you could be blocking access to legitimate users for that short duration of time.
Finally, a forward looking note. It seems it might be becoming more and more difficult to map ips to the right geolocation. The future seems to comprise of decoupling ips from location, especially with the advent of telecom providers doubling up as isps and satellite based isps. We will only see more ip address portability across isps and countries, not less
Code for all this resides at https://github.com/sap43/nftables-geoip, along with an explanation of the firewall rulebase. The github firewall is a little more elaborate than the plain jane goeip service described below. We will build up to that level of maturity with time and with more blog posts and more cups of coffee.
By the by. This doesn’t even need to eventually end up in a firewall. Once you have all this data available for querying, you could also throw up an ip lookup or country to ip address lookup service as a website and monetize the same etc etc.
Alright. Lets build this. We start right from the top.
We go to the IANA website since we also want stats on how many ips have been allocated to which region. Head to https://www.iana.org/assignments/ipv4-address-space/ipv4-address-space.xhtml, which lists the RIR-ip allocations. We download a csv version of the file, and then analyze it. There are 256 prefixes, from 000/8 to 255/8, possible, each of which is listed in this file. Out of this address space, 204 have been assigned to one of the 5 RIRs. Breakup of these 204 addresses is listed below. It is a little disconcerting to see that each ip range has been allotted in /8 ranges. Why? We are in luck. The IANA specifies that as a matter of policy, they only assign /8 ranges directly, https://www.icann.org/resources/pages/allocation-ipv4-rirs-2012-02-25-en, so we are good, we can trust this data and we don’t need to worry about /9, /10, /11 etc etc allocations. Or about classes for that matter, all of which would have complicated our analysis somewhat.
ARIN has almost half the total RIR allocation of 204 prefixes, with AFRINIC the lowest allocation at just 6 prefixes. Each of these is 256^3 or around 16.7 million ip addresses.
AFRINIC - 6
APNIC - 51
ARIN - 95
LACNIC - 10
RIPE - 42
The interesting takeaway here is direct allocation by IANA to organizations managing large enough networks that a 16.7 million ip block allocation is warranted. Yeesh. Okay. So if the IANA is doing this, then it makes sense to assume that RIRs also will allocate blocks directly to either a country level org that sub-allocates further (LIR, NIR etc), or to an autonomous system.
We head on over to the ftp site for our RIR, APNIC. These allocations are all updated at a daily cadence, right around the same time. Head on over to https://ftp.apnic.net/pub/stats/apnic/, and see the dates and times for each version of the file published. Around 1.17 AM daily. Neat. Definitely running a cronjob…..
So now what we will do is build our own script pulling the latest allocations, always with a file name “delegated-{RIR name}-extended-latest”
For example, delegated-apnic-extended-latest.
We want to download and parse the extended allocation files, these contain about double the information that the non-extended files contain. The apnic extended file is is around 8.3 mb, but the non-extended file is 3.8 mb.
In case you want a multi region whitelist, you will have to head to the site for each RIR and pull the corresponding version of this file from each RIR. You could of course always also head to https://ftp.ripe.net/pub/stats/ripencc/nro-stats/latest/nro-delegated-stats if you wanted, and just download the merged file, no need to hit 5 different urls, whatever works best for your pipeline.
What works in our favor is that all the orgs maintaining these ip allocations publish their ranges in the same format, so once you’ve built your parser for one region, you can easily extend this code to handle all regions across the globe.
Im going to explain the file format very quickly here, the rest of the process is straightforward and fairly logical. Your first few lines contain a summary of the entire document, such as a disclaimer, and a check of sorts (how many total rows for ipv4 and ipv6 data we have). We also have a similar row for as’, listing out which autonomous system number was allocated to which country and when. Ummm. Okay? For AS’, sadly, we don’t have allocation data present in these files, but why! If the RIR is allocating numbers directly to AS’, and the IANA is allocating ranges directly to AS’, then who is allocating ips to the AS’ assigned numbers by the RIR? Presumably, these would be international, intra regional AS’ operating large enough networks to warrant the provision of an ASN directly by the RIR. Im guessing their IP ranges are present somewhere, but not necessarily in the same file. Lets first work with whatever data we have on hand.
Below is what a sample ip allocation and asn allocation row look like.
apnic|AU|ipv4|119.13.240.0|2048|20090518|allocated
apnic|JP|asn|1233|1|20020801|allocated
For the ip allocation, we start out with the name of the regional internet registry that maintains this file, then the country code, the type of range (ipv4 or ipv6), the starting point of the block, the number of ip addresses allocated, the date of allocation, and the current status. And that is the foundation stone upon which we will build our geoip based blocking service. We simply need to first filter for all rows with the target country of interest, then filter for all ipv4 and ipv6 ranges, convert the desired ip ranges into cidr notation that nftables will accept. This gives us the country specific whitelist for the region.
The asn row is also similar. It lists the rir handling the allocation, country code, the filter value ‘asn’ (to differentiate the row from ipv4 and ipv6 allocations), the asn value, how many such values are assigned, date of assignment and the current status
Country codes for all countries across RIRs are at
https://www.apnic.net/get-ip/get-ip-addresses-asn/check-your-eligibility/iso-3166-codes/
We now write a couple of python scripts that will hit the url for our file, save the file to disc, then load it into memory and process its data into a pandas dataframe, do the filtering by ipv4, ipv6 necessary, then filter by the desired country codes, format resulting ip ranges into CIDR notation, save the final ranges to a fresh, clean txt file in the right format for nftables. Our nftables rules already import this file, so we just restart nftables once this entire process is over, which triggers the import of the updated whitelist. We use a Debian distro, so we use cronjob to run these scripts every day, keeping our ip whitelists nice and fresh and updated.
Nftables will now filter out and drop all traffic not originating from India, in realtime. This is over and above the existing protocol and port specific rules you might have present in your firewall.
Diving down the asn – ip address allocation rabbit hole
Even though the file we looked at is assigning ip addresses to NIRs and LIRs, eventually these ips need to be allocated to organizations. All ips assigned by an NIR/LIR will have to be done so out of the range assigned to the registry by the RIR, so we don’t need to worry about allocations by the NIR/LIR to orgs, these are subsumed within the ranges we find in the RIR allocation files.
So now we just need the ip address ranges allocated to autonomous systems by RIRs. Two components to this. The core, is to whitelist traffic from all AS’ registered in our countries of interest we want to whitelist traffic from. This broadly makes sense. Malicious or otherwise, all traffic from whitelisted countries is allowed through and will be blocked at the application layer, or if repeatedly misused, will be temporarily blocked at the network layer temporarily. Cant do permanent blocks at the network layer since almost everyone hitting our site will be using CGNAT, which means the public static ip available in our logs is as close to useless as possible from a individual system targeting perspective, that single ip is probably being used by a couple million different users.
The second, optional component is to use broad, permissive rules that whitelist all traffic from all AS’ assigned ips by RIRs that are end users, globally, for all countries that aren’t whitelisted. This maintains our attack surface area, since an end user of ip ranges being assigned ips directly by an RIR is going to be a large, significant corporate that will not be the source of malicious traffic. Again, think a Microsoft or an Apple.
This logic should apply to all as’ that are corporate networks, but whitelisting basis AS details/nature becomes a little tricky. There is no field within any file that describes well and succinctly the nature of business of the AS, or even more simply, whether the assigned ips are for internal corporate use or for reselling. There are files with descriptions of the company operating the as, and quite often the name of the company itself is somewhat helpful, plus theres excellent resources like https://www.potaroo.net/bgp/iana/asn-ctl.txt that have a global list of as company names. Still difficult to pull off, but for our purposes, building out the core component should be good enough to move on. The second component also takes us far beyond the traditional meaning of an ip-geolocation based filtering service, but I guess that’s what we get for wanting to build our own service, we get to test definitions and assumptions and push the envelope a little bit.
Okay. But where is all this data? We still need the ip range allocations, keeping in mind this could be over and above the country allocations or be subsumed by them. In the sense, using IANA as an example, the as ip range allocations are independent of the country allocations. Microsoft may have its global HQ in the US but its ip allocation by the IANA is over and above the allocation to the RIR for the Americas. Is the same principle at play with AS RIR IP allocations?
It took me sometime to get this, but ip assignment means nothing about the path to the network, or the mac addresses of the routers that will host these ips. You may have assigned ips to an org, but now, how do you determine which router on the global network has the ip you assigned to the org. Bear in mind you are assigning ips to legal entities and not directly to network devices. Shockingly, and apparently, it turns out that external bgp routers simply announce which ips they manage. The nerve. Lol. And theoretically, though it has become much more difficult now, I could set up my own bgp router announcing fake routes and ips being controlled by my router, and thus disrupt the flow of traffic on the global network.
Keeping aside the security aspect, lets assume we have a high trust network and can trust the announcements of all routers on the network. The best path to any ip needs to be discovered by probing the network incrementally for the best path to the most specific version of the route found in the nearest bgp routers, there is no global systems view that exists at any central authority that can be queried once to determine this. Match most specific path possible, do supernetting until something found, else use default gateway. Spread outwards. We have a lot of lookups across a large number of devices in an outward radiating fashion with the source at the center. Fairly computationally expensive. Thank god for Moore’s law and 30 years of it upholding.
I always thought, before I peered under the hood, routers would discover paths incrementally, octet by octet, and all networks falling within a similar octet range would be co-located within network topology terms. Ere twas not to be, I guess. Which is all well and good, since we would probably end up here only querying decentralized routers for the best path so that we can take into account changing network topology, path latency, bandwith, etc, which means that rediscovering shortest available paths every few minutes makes sense. Building a decentralized global view of the internet would turn out to be very expensive, holding so much data in memory and on disc, and would make all lookups more expensive.
This is not the same as internal routing within an as though where the size of the network is smaller so it is possible to have a global view of the network and use simpler algorithms to discover shortest paths.
We do much fafo’ing. Read about bgp, paths, routes, network topologies etc. A chance email to Chuck at APNIC turns up luck. The data we need is on the ftp site for APNIC, https://ftp.apnic.net/apnic/whois/. I was getting a bit fearful this database hadnt been made publicly available, and we would have to do things like querying a whois gui interface (https://whois.ipip.net) everyday or a bgp lookup tool or a looking glass server (https://bgp.he.net or https://bgp.tools/as/) 12000 times to identify the ip ranges allotted to each asn handed out by our RIR, APNIC. Or having to parse routing info in MRT files (https://github.com/rfc1036/zebra-dump-parser?tab=readme-ov-file). Urgh.
That link hosts a whole range of files. Of interest to us are the apnic.db.inetnum, apnic.db.as-block, apnic.db.aut-num, apnic.db.route files. We are looking ideally for a file with country codes, AS numbers, and ip ranges.
The inetnum file lists ranges and country codes but no asn’s. The as-block file lists ASN blocks and the country code but no ips. Likewise for the aut-num file. The route file lists all 3 parameters of interest. Interestingly though, each file has missing data points as we will soon discover, thus warranting the use of all these files to fill in gaps in our db.
Lets begin with the route file. A lovely 200 mb csv. Self explanatory structure. Spend some time just glancing through it to get a feel for it, so we can build our parser. Then load it up using python, get rid of the first 16 rows, parse the rest of the file line by line, looking for lines beginning with origin, country, and route, placing all this into a pandas dataframe for analysis later on. Doesn’t take long, a mere 4 seconds on my somewhat old and trusty machine. I end up with 7.5 lakh odd rows of data.
We check to see how many unique asn’s exist without a country code, and we have 9k odd. We started out with 12k odd numbers being assigned to AS’, and are down to 9k odd without a country code, but with ip range data. Lovely. More interestingly, 6.5 lakh ranges don’t have an assigned country code. Yeesh. Apparently most of this file is missing this datapoint. This is where the remaining files come in, you would think, but not yet. We are not done squeezing lemonade out of this lemon.
Since the routes file lists ASNs and country codes by ranges, is it possible that the same AS has been assigned another range somewhere in the file, and it has a country code marked? Lets check this out. This will be expensive code, computationally, since we will need to filter the pandas dataframe by the ASN, then check if any valid country codes are present, and if yes, we then assign these to all ranges for that ASN missing the country code. The valid assumption here being that each ASN can and must have registered head offices in just one country.
We get a break! 13 minutes later, we are down to just 750 odd ASNs and more importantly, just 6.3k odd ip ranges missing country codes.
Lets keep pushing. We pull up the aut-num file, load it, parse it, place its relevant data in a pd dataframe, interpolate missing data, and then check again. Down by a bit more to just north of 700 ASN’s. Hmmmm. Technically speaking, we are at a coverage gap of less than 6% wrt ASN’s (700/12k) and less than 1% wrt ip ranges (6300/7.5 lakh). Of course, the even smarter way to calculate this would be basis ip addresses, and not ip ranges or ASNs. Missing out on the code for a /8 range would be devastating in practice but wouldn’t show up here in our calcs as its just one range out of 6300.
Anyhoo. We do the same with the last file in our toolkit, as-block. This needs a little more processing, we first need to find the right block range that our ASN falls within, and then we can determine the country code. We do this. Run through the same shigamadig. Sadly, this file is also missing country codes fairly frequently, but we do the dew, and end up with 665 ASNs and 5.3k odd ranges missing country codes.
Hmmm. This is still feeling sort of unpalatable. Lets however build our entire pipeline first, and then we will come back and refine this further. So now that we finally have ip ranges for most of the ASN’s at the RIR level, lets conduct a check. How many of these ranges were allocated out of a country allocation……
For this, we go back to the RIR extended delegated ip ranges file, and cross check ip ranges from the enhanced routes pd df. We run through each ip range in the routes file, and check if its present / subsumed by the ranges in the RIR file, using the ipaddress library. Or. A smarter way could be to simply run this check on the list of ASNs for which we have defined ranges but no country data. Hmmm. Lets see what happens.
27 minutes later, (thank god we started out with the leftover ASNs list), we are done. We had set up two counters, one is incremented when the leftover ASN ip range is found within a country allocation, and the other when this subsetting operation fails.
Yeesh. 5300 odd ip ranges were analyzed, and guess what? Each and everyone was found to be a subset of a country allocation. The nerve. Is it possible theres a systemic explanation for this? Why would these ranges be missing their country code when they clearly come from a country allocation?
Okay. So, apparently what the RIR is doing is making a country allocation first, and then allotting a subset of this to AS’ requesting resources and registered within said country. Oh lordy. Computationally, I guess this is good news. We need to do much much much less processing every day to update our whitelists. This was just beginning to get fun though, because guess who doesn’t want to run over an hour of code every day just to update a firewall whitelist, and has two thumbs? Optimizing this to run in a couple of minutes would have been super fun!
Alright. So that’s it for the moment. Apparently, as it turns out, none of this ASN rabbithole was necessitated. Just pulling data from the extended files and whitelisting it should get you started with your very own geoip filtering service.