Geolocation, building DAT and MMDB files
Spoiler: Most software programs that read geolocation data are only compatible with the DAT or MMDB format, which means you have to buy them from the format’s publisher (maxmind). To break the monopoly, the arsouyes share their conversion scripts with you, so you can produce your own files.
When using off-the-shelf applications or libraries, you’re dependent on the choices made by their developers. Geolocation is not an exception, and if you want these tools to work, you need to provide them with databases in the right format.
This market is in fact dominated by a single company, Maxmind, and
the two file formats it has developed, the old dat
and the
new mmdb
. If the tool you’re using does geolocation, it
will need files in one of these two formats.
The problem is, that to produce the most accurate files possible, Maxmind and its competitors can’t content with public data. They also have to collect and correlate personal data. Of course, they don’t communicate about how they do that, but when an application asks you for your location, you now have an idea of potential customers.
If, like us, you’ve set up a DNS ad blocker, their domain is blacklisted. To access their websites, you’ll need to add an exception to your firewall.
If you don’t want to be involved in this data grabbing, you should avoid with their services and build your own files. You’ll lose accuracy (by getting the country, not the city) but you’ll gain our respect. All in all, you win.

After showing you that these data are public
and then providing you a PHP script
that retrieves them and produce a sqlite
database, today
we’re going to look at how to produce files in these two classic
formats.
Note to devops. You can also use these methods to produce files with artificial data (e.g. private addresses or non-existent countries). Useful for testing or pre-production.
DAT format
Or rather GeoIP legacy, the first file format developed by MaxMind for storing geolocation data. Due to its great age, this format is available natively in many languages (C, Java, Perl, PHP, … cf. official list). The disadvantage is all the data (e.g. country, city, ISP) are in separate files.
Technically, this format models the data set as a huge binary tree, with each node corresponding to a network and the two children to the two immediate sub-networks. The leaves store geolocation information (country, city, ISP, etc.) depending on the case.

For this format, Maxmind provides ready-to-use tools, some of them
are command-line based, for building your own database and performing
searches. These are available in any Linux distribution via the
geoip-bin
package. For Ubuntu and Debian, the following
command will install what you need.
sudo apt-get install geoip-bin
Build a CSV
The data provided by the Regional Internet Registry are already in CSV format, but the fields are not compatible with Maxmind’s tools. So you need to build a new file with the right fields, in the right order…
This information is not documented, but you can reconstruct it. Either by looking at the CSVs published by Maxmind, or by looking at the source code of the corresponding binaries (there’s an
enum
whose elements have evocative names).
If you want to make your own file, here’s the order of the fields in
the CSV
files, only three of which will be useful during
conversion:
- First address of the network,
- Last address of the network,
- First address (in a 32 bits integer) of the network, unused,
- Last address (in a 32 bits integer) of the network, unused,
- Country Code,
- Name of the country, unused,
Notes to devops : If you want to use the following scripts in order to construct your test databases, you can use
CSV
files with your test sets.
If you use our PHP code, you can add the following function at your class, in order to export IP V4 networks :
public function toCSV4() {
$st = $this->pdo->prepare("select * from IPv4") ;
$st->execute() ;
foreach ($st as $row) {
echo long2ip($row["start"]) . "," ;
echo long2ip($row["end"]) . "," ;
echo "," ;
echo "," ;
echo $row["country"] . "," ;
echo "\n" ;
} }
IP version 6 is similar, but as we’ll see, the
dat
format doesn’t handle it well. I’ll show you how to make acsv
for IP version 6 later in the article.
To create the CSV
file, you can now call the function in
a stand-alone script, which opens the database passed as the first
argument and writes the contents in CSV
format to the
standard output:
#!/usr/bin/env php
<?php
include "Database.php" ;
$db = new Database($argv[1]) ;
$db->toCSV4() ;
Build a DAT
Now that we have a CSV
file in the right order, we can
pass it on to Maxmind’s converter. This is one of the binaries in the
package we installed earlier. By default, it’s not in the
PATH
, you’ll have to call it with its absolute address (or
modify the PATH
, but that’s not as good):
/usr/lib/geoip/geoip-generator -4 -v -o geoip_database.dat geoip_database.csv
For version 6 IP addresses, you’d have to change the -4
option to -6
. The -v
option gives you a bit of
information on the progress, it’s optional but I find it useful.
Tourism: This format is designed to be efficient in search and storage, it takes up much less space than the
CSV
version; 2MB fordat
versus 11MB forcsv
.
Check and use
To check the contents of the database, or use it on the command line,
you can use another of the available binaries, geoiplookup
,
which searches the database of your choice. This binary, however, is in
the PATH
and you can invoke it by name.
$ geoiplookup -f geoip_database.dat 188.165.53.185
GeoIP Country Edition: FR, France
You’ll notice that it found by itself that the code FR
corresponds to France, without my needing to use a mapping table.
IP version 6
It seems that this format is not adapted to IPv6. You can produce a
CSV
file with IPv6 data (including ipv4 addresses inside
dedicated networks) then convert it with geoip-generator
(and -6
option). The problem is during the seach…
On one side, the tool geoiplookup
is buggy and
nobody wants correct it (the tool considers ipv6 addresses as hostnames
and don’t succeed to resolve them).
On antoher side, the official
PHP
library fail to search inside those files. IPV4 don’t
find anything, IPV6 generate fatal error.
GeoIP API: Error traversing database - perhaps it is corrupt?
If you really need ipv6 support, you must use mmdb
format.
MMDB format
Or rather GeoIP2, the second format developed by MaxMind to fill the gaps left by its predecessor. This format can store any structured data corresponding to a network. Maxmind distributes its files with geolocation data, but the format can in fact use your own structures.
This time, Maxmind hasn’t bothered to develop a whole host of libraries and tools. The specifications are available and only the most common languages have a reader. It’s up to developers to add their own libraries for their favorite languages.
Now that the trend has changed, Maxmind also offers web services. You no longer need to update your files, but would have to pay a subscription.

CSV for IP version 6
If you only use version 4 IP addresses, the previous csv
will do the job. Otherwise, here’s how to export the contents of both
version 4 and version 6 IP databases into a single csv
which you can then transform into a mmdb
…
Once again, we’ll complete the class with a few functions.
The first function converts the integers stored in the table into readable version 6 IP address.
public static function bigInt2Ip6($ip) {
$bin = pack("J2", $ip, 0) ;
return inet_ntop($bin) ;
}
The second will write the csv
lines for version 6 IP
addresses. I only fill in the useful fields, so the others are
deliberately left blank.
public function toCSV6() {
$st = $this->pdo->prepare("select * from IPv6") ;
$st->execute() ;
foreach ($st as $row) {
echo self::bigInt2Ip6($row["start"]) . "," ;
echo self::bigInt2Ip6($row["end"]) . "," ;
echo "," ;
echo "," ;
echo $row["country"] . "," ;
echo "\n" ;
} }
We could stop there, but in order to produce a database containing the geolocation of all IP addresses in a single file, we’ll have to translate the old IP version 4 addresses into their IP version 6 format. And there are sub-networks for just that.
::0:0/96
is the IPv4-compatible subnetwork. In other words, addresses that have the same integer value in both networks. This subnet is obsolete and should be replaced by the next one (but as some libraries still use it, we sometimes have to manage it too).::ffff:0:0/96
is the official subnet for representing IP version 4 addresses. These addresses can be manipulated in software, but are not routable.
public function toCSV4as6() {
$st = $this->pdo->prepare("select * from IPv4") ;
$st->execute() ;
foreach ($st as $row) {
echo "::" . long2ip($row["start"]) . "," ;
echo "::" . long2ip($row["end"]) . "," ;
echo "," ;
echo "," ;
echo $row["country"] . "," ;
echo "\n" ;
echo "::ffff:" . long2ip($row["start"]) . "," ;
echo "::ffff:" . long2ip($row["end"]) . "," ;
echo "," ;
echo "," ;
echo $row["country"] . "," ;
echo "\n" ;
} }
Tourism : Normally, IPv6 addresses are written in hexadecimal. But for readability, it’s allowedto write the last 32 bits as you do with IPv4 addresses (numbers between 0 and 255 separated by dots). I’m using this convention here to simplify writing.
With these three new methods, we can write a script to convert the
sqlite
database to csv
:
#!/usr/bin/env php
<?php
include "Database.php" ;
$db = new Database($argv[1]) ;
$db->toCSV4as6() ;
$db->toCSV6() ;
Build a mmdb
The only available library for writing your own files is in Perl, so
we’ll now turn to this language. In order to use third-party libraries,
we initialize cpan
.
cpan
As I don’t have any special features, I leave all choices by default. Once the process is complete, you need to restart your terminal to ensure that everything has been taken into account (this is explicitly stated at the end).
To create a mmdb
file, you need the library developed by
Maxmind, Maxmind::DB::Writer
. It is installed via
cpan
using the following line:
cpan install MaxMind::DB::Writer
Instead of giving you the whole script at once, I’ll present it to you little by little. Let’s start with the classic inclusions in any :
#!/usr/bin/env perl
use strict;
use warnings;
use MaxMind::DB::Writer::Tree;
It is then necessary to declare the data types that will be inserted into the search tree. You can think of this as a schema in a NoSQL database. I’ll come back to these fields and their meaning later.
my %types = (
'map',
country => 'utf8_string',
iso_code => 'map',
names => 'utf8_string',
fr => 'boolean',
is_in_european_union => );
We can create an object to store the search tree and the data. Here, we’re just talking about meta information, which in a way specializes the database.
database_type : is a string that tells you what’s in the database. It must contain
Country
(case-sensitive) so that it can be used in third-party libraries (we’ll come back to this later),languages : lists the languages supported for country names (because you can insert names translated into several languages), which is only useful for readers checking that the file is compatible.
description : allows you to say more about the base, note that it’s internationalized,
ip_version : 4 or 6 depending on the format of your addresses,
record_size : record size in bits, 24 or 32, the doc isn’t very clear: “ you can put 128 but drives only accept 24 or 32 ”
mak_key_type_callback : the function that determines how to store the data.
my $tree = MaxMind::DB::Writer::Tree->new(
'Country',
database_type => 'fr'],
languages => [
description =>'Geolocalisation IPv4', },
{ fr => 6,
ip_version => 24,
record_size => # Typage
sub { $types{ $_[0] } },
map_key_type_callback => );
Once the object has been created, we can build the CSV
,
how to crawl it and the insertion of data into the tree. This is where
field listing and typing is used. Technically, you can store any
structure you like in the tree. The mmdb
format is designed
for this, and your readers will simply have to be compatible with your
database.
As the aim is to create a database compatible with our tools, our
data must be formatted in the same way as official databases. No
documentation exists, so I had to retro-design the libraries
(in PHP
if you must know):
country
contains geolocation data at country level.iso_code
is the two-letter country code, ISO 3166-1 alpha-2,is_in_european_union
is only present (and true) if the country is in the European Union. Here, I put yes all the time to show you the syntax, but for a real database, a littleif
would be necessary,names
contains a table where the country name is translated into the languages, here I’ve only put French (fr
) and I’ve put the country code, to show the syntax,geoname_id
can be used to reference the country with its code in the geonames database, but to keep it simple, I haven’t managed it at all.
my $csv = $ARGV[0] or die "Need 2 arguments" ;
open(my $data, '<', $csv) or die "Could not open file $csv\n" ;
while (my $line = <$data>) {
chomp $line ;
my @fields = split "," , $line;
my $record = {
country => {$fields[4],
iso_code => 1,
is_in_european_union =>
names => {$fields[4],
fr =>
},
},
} ;$tree->insert_range($fields[0], $fields[1], $record ) ;
}
Once all the records have been inserted, all that remains is to save the file, the easiest step.
my $filename = $ARGV[1] or die "Need 2 arguments" ;
# Write the database to disk.
open my $fh, '>:raw', $filename;
$tree->write_tree( $fh );
close $fh;
With this script, you can retrieve the previously exported
csv
database and convert it to mmdb
:
csv2mmdb.pl geoip_database.csv geoip_database.mmdb
Check with Perl
As we don’t have a ready-made tool, we’ll continue with perl scripts.
This time, the required library is different,
MaxMind::DB::Reader
.
Note that if you’ve already installed the
Writer
, theReader
is in fact already available.
cpan install MaxMind::DB::Reader
This time, the code is much simpler: just import the modules, open the file, search and display the data.
#!/usr/bin/env perl
use strict;
use warnings;
use MaxMind::DB::Reader;
my $filename = shift @ARGV or die "Usage: $0 <file> [ip_address]";
my $reader = MaxMind::DB::Reader->new( file => $filename );
foreach my $ip (@ARGV) {
my $record = $reader->record_for_address( $ip );
print $ip . " => " . $record->{country}->{iso_code} . "\n" ;
}
You can then perform a command-line search:
./search.pl geoip.mmdb 188.165.53.185 ::ffff:188.165.53.185 2001:41d0:301::21
188.165.53.185 => FR
:188.165.53.185 => FR
::ffff2001:41d0:301::21 => FR
Check with PHP
Since it’s important to create compatible databases, here’s another
search script, in PHP
, using GeoIP2, the official
Maxmind api.
To keep things simple, I use this API via composer and the following command line:
composer require geoip2/geoip2:~2.0
I can then use the autoloader and the classes provided by Maxmind. The following example is inspired by the official :
#!/usr/bin/env php
<?php
require_once 'vendor/autoload.php';
use GeoIp2\Database\Reader;
$reader = new Reader($argv[1], ['fr']);
for ($i = 2; $i < $argc; $i++) {
$ip = $argv[$i] ;
$record = $reader->country($ip);
echo "$ip => " . $record->country->isoCode . "\n" ;
}
This PHP
example enabled me to discover some of the
subtleties of the mmdb
format and its use for
geolocation…
The country method used in PHP to search for a
record checks that the database_type
specified in perl when
creating the tree contains the string Country
somewhere.
Otherwise, the PHP
API throws an exception because,
according to it, there is no country data in your file…
throw new \BadMethodCallException(
"The $method method cannot be used to open a {$this->dbType} database"
; )
Field name changes depending on the API. In Perl (and inside the file), their name is in snake_case with underline between words. In PHP, their name is in camelCase, with capital letters at each word… The explanation can be found in the official code :
public function __get($attr)
{// XXX - kind of ugly but greatly reduces boilerplate code
$key = $this->attributeToKey($attr);
// [...]
}
private function attributeToKey($attr)
{return strtolower(preg_replace('/([A-Z])/', '_\1', $attr));
}
With these problems solved, the script can be used to obtain the country code for IP addresses:
$ ./search.php geoip.mmdb 188.165.53.185 ::ffff:188.165.53.185 2001:41d0:301::21
188.165.53.185 => FR
::ffff:188.165.53.185 => FR
2001:41d0:301::21 => FR
And now ?
With these scripts, you can now produce your own databases compatible with all geolocation tools. Either with artificial data for test phases or pre-production to check that integration is OK. Or with public data to geolocate your users while respecting their privacy.
To continue on this theme, this article may be of interest to you.
- Anonymize IP addresses for statistics
-
March 23rd 2020 I don’t know about you, but I love doing statistics. Since IP addresses are personal data, in order to do this well, they must first be anonymized. Two sed rules will be enough to keep only the first two numbers of IPv4 addresses and the first 3 groups of IPv6s.