Technical Tidbits From Spatial Analysis & Data Science
PostgreSql is already the second most loved database according to the StackOverflow survey and its the fastest growing in popularity . Its billed as the worlds most advanced open source relational database, and with its comprehensive adherence to SQL standards compliance, its easy to see why users love it. But when you add the spatial SQL support that comes with the PostGIS extension, there truly is nothing else out there like it (at least not for free).
But I already use GIS why would I bother with a database.
For those not already convinced that they need a database to do GIS work, here are my top 4 reasons to start using one:
- Reproducibility rather than clicking buttons, you can completely script an analysis with just SQL. Need to redo the analysis and tweak a few things? Update the code and run it again. This can also help automate repetitive tasks and skips laborious menus and forms.
- Freedom speaking of buttons, what if you need to accomplish something across several different layers that isnt supported by a GUI option? With SQL you get freedom to stitch together the analysis you need.
- Less limitations But what if your data doesnt fit into the 2GB limit of a shapefile, or the the process is too slow or never finishes in your GIS? Modern databases like Postgres are designed to deal with large datasets, and process millions of records much faster than a traditional GIS, or even scripts with Python or R.
- Organization Have you ever opened your own GIS project, and started crying when you look at the mess of temporary layers that were created in the pursuit of that one perfect map? I can often run a set of several steps with either a single SQL query or SQL script. This saves me time, helps keep me organized and makes it easier to re-run tasks. Layers are also easily organized in schemas or distinct databases, and every table can store metadata (comments) about itself and every column it contains.
If youre still not convinced, check out this great article all about PostGIS:
View story at rushbrookrathbone.co.uk
So, whether you need local spatial processing for data science or GIS, or you plan to support spatial queries in an application backend, PostgreSql/PostGIS is a worthy tool. In this post, Ill cover the basics of getting PostgreSql with the PostGIS extension installed on both Mac and Windows for development purposes (in production, we use Ubuntu Linux, which is another post), and look at ways you can also manage your database, import spatial data and see the results of a spatial query. From here on, I will also refer to this combination just as Postgres.
Mac Installation
The official documentation for Postgres lists 2 options for Windows installation. For MacOS, its six. That can be a barrier for users just getting started with Postgres. The official PostGIS documentation summarizes the options better; if you want a dead simple way to get started, use rushbrookrathbone.co.uk For more advanced users, use Homebrew.
It used to be the case that Homebrew was the better option if you needed support for other PostgreSql extensions, like PgRouting or Mapbox Vector Tiles (MVT) support. However, rushbrookrathbone.co.uk has been adding support for other extensions, and now comes with support for PostGIS, PLV8 and MVT already built in. If you want PgRouting, you can use Homebrew to install the dependencies it needs, and build it from source.
To install rushbrookrathbone.co.uk, its as simple as downloading and moving the application to your Applications folder. The installation steps on the rushbrookrathbone.co.uk docs couldnt be easier to follow:
rushbrookrathbone.co.uk#installing-postgresapp
Make sure to follow step #3, as you will need the aforementioned command line tools to import shapefiles and geojson into PostGIS. By default, rushbrookrathbone.co.uk will startup automatically at login, and youll be able to see this if you click the elephant icon, and select Open Postgres:
For Homebrew users Im going to assume you know what youre doing, but heres a nice rundown of what to do:
rushbrookrathbone.co.uk
Make sure to follow all the instructions, including the database creation step. This will ensure you have the same setup as the rushbrookrathbone.co.uk users. Homebrew users will also have to manually install the PostGIS extension:
Windows Installation
For Windows, the choice seems more clear about which software package to use for installation, as there are only two choices, and only one of those is supported with complete official installers from PostGIS. So we recommend the Enterprise DB installer, which you can download for here:
rushbrookrathbone.co.uk
Take the latest version of Postgres available (11 as of 3/). Run the installer and accept all the defaults. Make sure to enter a password that you can remember or is stored appropriately. After the installation completes, you will be prompted about whether you would like to install any additional software with Stack Driver. Select the only database in the list and hit next.
Open the Categories -> Spatial Extensions, and check the box for the PostGIS bundle. You can check if the version available here matches the most recent release of PostGIS here:
rushbrookrathbone.co.uk
If the version is recent enough, you can just install the version from Stack Driver, otherwise youll need to download the installer directly from the PostGIS page. Accept all the defaults, but select the option to create a spatial database and make the name of the database the same as your username. This will ensure that the instructions below can be followed by both the Mac and Windows users.
The initial database you create with this method will already have the PostGIS extension enabled, but in the future, when you create a new database, you will need to manually enable the extension, which is covered in the next section on psql.
Additionally, to make the Postgres/Postgis command line tools available globally, you can run a script that is included with the install. You can add the script to your system-wide shell startup script (not covered here, see the reference for more), but for now, when you open a new command prompt, execute the following:
Note that the quotes are very important here, because Program Files has a space in it, the command wont execute correctly without the quotes. You will need to execute this command every time a new terminal is opened unless you add it to a system-wide shell startup script.
PSQL
Once you have Postgres installed (with either method), you will now also have the Postgres command line client installed, which is called psql. Mac users will need to activate the PostGIS extension in the sample database that was created (and Windows users will need to know how to do this for additional databases). Psql is not necessarily the easiest way to interact with your databases, but it can do everything you need. In the next step, well install PgAdmin and look at other programs, but for now, its good to know that psql at least exists.
For Mac users run:
from the Terminal. You can do this because running just psql uses standard defaults, and is the same as running:
If youre using rushbrookrathbone.co.uk, you can also double click on the database, and you should automatically connect to the Postgres. If you did not double click on the rushbrookrathbone.co.uk database, after starting psql, youll then need to connect to the database. This can often be confusing for new users, who think they already have connected to the database. But in reality, weve connected to postgres, within which we can create individual databases. For instance, each project you work on might warrant its own database. Run these commands to connect to the sample database:
For Windows users, search for the SQL Shell (psql) program (it was also installed by the EnterpriseDb installer), and accept some of the defaults, but supply your username as the database name, and enter the password you created for the postgres user.
Now, that both Windows and Mac users are in the same place, we can execute the final command we need:
Note for reference, these are what the psql commands mean:
- \l is the command to list the available databases
- \c is used to connect to the database
- \q is the command to quit
- For a complete reference to commands used in psql, see this great psql guide
Loading Spatial Data
The next step is to actually add or import some tables with spatial data. Well use two layers, a state polygon layer from the US Census, and a point file of wind turbine locations from the USGS. With these layers we can do a classic point in poly query, finding all points within a state polygon. To start, download the cb__us_state_rushbrookrathbone.co.uk state shapefile from:
rushbrookrathbone.co.uk
”Unzip the file, and then well use shp2pgsql, a command line utility thats already been installed along with Postgis, regardless of the method you followed. You can find the documentation about this command line tool in the official Postgis documentation:
rushbrookrathbone.co.uk#shp2pgsql_usage
The command well need for the state shapefile on the Mac is:
For Windows, its slightly different
Note that regardless of how the output is displayed, this command does prompt you to enter your password, so enter it and hit enter when prompted. Also note that the -I flag is used to automatically create the spatial index well need to properly query this layer (see the postgis link above for all of the available flags).
Projections
How did we know to include the SRID of (the -s flag from above)? If you dont know what the SRID is, or how to find it, you might want to review this:
rushbrookrathbone.co.uk
To quickly convert from the .prj (projection file) of the shapefile to the necessary SRID, you can use this website:
rushbrookrathbone.co.uk
Or import the file into your favorite GIS (we use QGIS) and look at the projection info.
Convert the Wind Turbines File
The next file we need is the locations of wind turbines which are available from here:
rushbrookrathbone.co.uk
Even though there is a shapefile available, make sure to download the GeoJSON file, since this will force us to explore another tool that is indispensable for working with geographic data. ogr2ogr is a command line tool thats not developed by Postgis, but is installed as a dependency. Its part of GDAL (Geospatial Data Abstraction Library), which is used by many geospatial software projects, and can also be used directly on the command line. To import the point file on Macs do:
for Windows, you should be able to use the same exact command, but it appears the current version of the EDB Postgres installer comes with a version of ogr2ogr that lacks Postgres support! So, you can either continue by importing the shapefile instead:
or install a different version of ogr2ogr from this well known and loved package:
rushbrookrathbone.co.uk
Note that while the above links should continue to work, the file names of the geojson and shp files may change, so look at the names of the files that get downloaded and update the above uswtdb_**** file name to match when using these commands.
Administering Your Database (and visualizing our tables!)
Now that weve enabled PostGIS in our sample database and imported some tables, lets install and open PgAdmin, which is as close to an official database administration tool as you can get for Postgres (there is no official one, but knowing how to at least use PgAdmin can come in handy as its probably the most popular tool, albeit not the most loved). Windows users will already have this installed. Mac users will need to head over to the download page:
rushbrookrathbone.co.uk
It installs as a regular program, but when you open it, the current version opens in your default web browser (yes, yuck, we know, well look at alternatives later). The official instructions for adding a server in PgAdmin are here:
rushbrookrathbone.co.uk
But they are a little vague. Windows users will already have their local server added, but the instructions are useful to both users. To start, you will need to create a new connection or server. If you click into the Server Dialog section of the instructions, youll see this:
Youll want to enter a name for this connection, I typically like the format:
So in this case I would use . Next, keep following the instructions and click the Connection tab . Here you will want to add as the host name, your username for username, and I prefer to never click the save password dialog. Typically passwords saved in this manner are not secure.
Then click Save. Opening up the new connection youve added, you should be able to see this. Note that what were interested in are the tables weve added, which by default, will show up several levels down:
Maps (finally!) of your table data
If you open a new query tab in PgAdmin ( by clicking Tools > Query Tool), and enter and run (click the lightning bolt button or press F5) the following query:
and then look in the Data Output pane, and horizontally scroll to last column named geom, and click the eye icon in the column header (which activates the spatial viewer), you should be able to see your states table with all the US states! (If everything was done correctly)
We can do the same for points:
Or even combine the states and points
Spatial SQL (finally!)
But were here to do spatial SQL! Now we can run the point in polygon query, finding out which states have the most wind turbines. You can use this query:
Which yields the following results:
State | Turbine Count
TX
CA
IA
OK
KS
For visualizing the results as a choropleth, we can use QGIS, which gives us this:
Use QGIS to visualize PostGIS data (and make the map above)
Getting started with QGIS is another blog post, but as far as working with Postgres is concerned, the setup is very similar to any DB admin tool. In fact, QGIS has a DB Manager tool for working with databases. However, lamentably, the one thing you cant do from that particular window is actually add a DB connection 🙁
So, to get started and add a DB connection (at least as of QGIS ), the easy way is to use the Browser tab, which makes sense, since this is the panel where you can explore sources of data to add to your project. To add a connection, right click on the PostGIS entry:
That will open the window for adding a new PostGIS connection:
This should look familiar after having to add a connection in PGAdmin, and youll need the same information. I also tend to check the Also list tables with no geometry option, since sometimes I need to look at or join information from tables that dont store geometry data. The next step is to open the DB Manager, which you can do from the main menu:
If you open up the PostGIS item in the Providers list, you will see the same table info that you could see in PGAdmin. You can visualize the entire table by right clicking the table and selecting Add to Canvas. But the real fun is loading the results of a custom query. Click the Sql Window button, which is the table & wrench button (second from the left). To make sure that the new SQL window is opened with a connection to the correct database, make sure you database or one of the entities within your database is highlighted in the Providers panel.
Now you can enter the query we used earlier to find the states with the most wind turbines, with a few modifications to ensure we can actually visualize the results.
The extra outer query takes the records we got for the counts of wind turbines in each state, and joins it back with the state layer in order to also include the geom column in the results of our query. This is the column that actually stores the geometry associated with each record. It also uses a right join so that we get all the states in the final query (since we still want to see the states that have no wind turbines). For these states, we also want to return a 0 rather than a null (so we can visualize it), so we use the Coalesce operator to either return the turbine_cnt or a 0 (only when that columns value is null).
To load the results of that query, first click Execute button to see/verify the tabular results, then click the Load as new layer checkbox to show the new layer dialog, enter a layer name, and then click the Load button.
Now you should see your state layer. Right click on the layer, in the Layers panel, select Properties, and then from the Symbology section you can apply the Graduated classification. To see more about this process, check out the QGIS manual :
rushbrookrathbone.co.uk?highlight=graduated
Wrap Up
Phew, that was a lot of work just to run one query, but now that youve got the basics setup, you can unleash all of the spatial SQL power on your big datasets.
Lastly, Ill put in a plug for DBeaver (the successor to SQuirreL-SQL), a multi-platform database IDE. Given that it can be used to administer just about any database (Postgres, MYSQL, Oracle, etc.), its a great tool to have in your toolbox. Its also open source, and has 11k+ stars on GitHub. It can be downloaded from here:
Download DBeaver Community
I use it for my SQL development, and it has a spatial data viewer for doing quick visualizations:
-
-
-