
Loading and checking data
Johannes Krietsch
Source:vignettes/load_and_check_data.Rmd
load_and_check_data.Rmd
Good to know
tools4watlas
is based on data.table
to be
fast and efficient. A key feature of data.table
is
modification in place, where data is changed without making a copy. To
prevent this (whenever it is not desired) use the function
copy()
to make a true copy of the data set. Basic knowledge
about data.table
is helpful, but not necessary, when working with
tools4watlas
.
Getting data
WATLAS data can either be loaded from a local SQLite database or a remote SQL database server. To do so, first select the tags and time period for which to extract data.
Use the tags_watlas_all.xlsx
file (including metadata of
all tags) or for collaborators the tags_watlas_subset.xlsx
(including a subset of metadata) to select the desired tags. Here shown
with the example data in tools4watlas
.
Loading the tags_watlas_subset.xlsx
will provide a table
with the following columns:
Column | Description |
---|---|
season | Year in which the bird was caught |
species | Species common name |
tag | Tag ID with 4 digits |
rings | Metal ring number |
crc | Colour ring combo |
release_ts | Release time stamp in CET |
catch_location | Location where the bird was caught |
Select the desired tags and time period
# file path to the metadata
fp <- system.file(
"extdata", "tags_watlas_subset.xlsx", package = "tools4watlas"
)
# load meta data
all_tags <- readxl::read_excel(fp, sheet = "tags_watlas_all") |>
data.table()
# subset desired tags using data.table
# (for example all tags from season 2023)
tags <- all_tags[season == 2023]$tag
# time period for which data should be extracted form the database (in CET)
from <- "2023-09-21 00:00:00"
to <- "2023-09-25 00:00:00"
Extract data from local SQLite file
First, the path and file name of the local SQLite database need to be provided. Then, with the established connection, the database can be queried for the selected tags and period. Here we will load the tagging data in one data.table.
# database connection
sqlite_db <- system.file(
"extdata", "watlas_example.SQLite", package = "tools4watlas"
)
con <- RSQLite::dbConnect(RSQLite::SQLite(), sqlite_db)
# load data from database
data <- atl_get_data(
tags,
tracking_time_start = from,
tracking_time_end = to,
timezone = "CET",
use_connection = con
)
# close connection
RSQLite::dbDisconnect(con)
Alternatively, extract from remote SQL-database
To safely work with database credentials one option is to store them as environmental variables in R. This allows for example to have scripts on GitHub without sharing them. Restart R after adding variables to be able to access them. Ask Allert for the host, username and password and then add them in your environment like this:
# open .Renviron to edit
file.edit("~/.Renviron")
# add variables
host = "host"
username = "username"
password = "password"
# access variables (example)
Sys.getenv("variable_name")
atl_get_data
does the same as when connecting to a local
SQLite database when connecting to a remote database. In this example
(chunk not run and only shown) we load the last three days of data from
all tags of season 2024. Host, username and password are specified as
environmental variables in this example, but can also be specified
directly.
Since we want the up to date metadata we load the
tags_watlas_all.xlsx
from the “WATLAS” SharePoint folder:
Documents/data/
. Either specify the path to your local copy
of this folder or add the path for your user in the
atl_file_path()
function.
# file path to WATLAS teams data folder
fp <- atl_file_path("watlas_teams")
# load meta data
all_tags <- readxl::read_excel(
paste0(fp, "tags/tags_watlas_all.xlsx"),
sheet = "tags_watlas_all"
) |>
data.table()
# subset all tags from 2024
tags <- all_tags[season == 2024]$tag
# select N last days to get data from
days <- 3
from <- (Sys.time() - 86400 * days) |> as.character()
to <- (Sys.time() + 3600) |> as.character()
# load data from database
data <- atl_get_data(
tags,
tracking_time_start = from,
tracking_time_end = to,
timezone = "CET",
host = Sys.getenv("host"),
database = "atlas2024",
username = Sys.getenv("username"),
password = Sys.getenv("password")
)
Data explanation
The resulting loaded WATLAS data will be a data.table
with the following columns:
posID | tag | time | datetime | x | y | nbs | varx | vary | covxy |
---|---|---|---|---|---|---|---|---|---|
1 | 3027 | 1695438802 | 2023-09-23 03:13:22 | 650692.8 | 5902549 | 3 | 46.97 | 392.42 | 126.62 |
2 | 3027 | 1695438805 | 2023-09-23 03:13:25 | 650705.6 | 5902576 | 3 | 49.09 | 460.84 | 141.21 |
3 | 3027 | 1695438808 | 2023-09-23 03:13:28 | 650691.6 | 5902536 | 3 | 58.18 | 471.81 | 155.89 |
4 | 3027 | 1695439189 | 2023-09-23 03:19:49 | 650728.6 | 5902571 | 3 | 49.97 | 441.46 | 138.20 |
5 | 3027 | 1695439192 | 2023-09-23 03:19:52 | 650721.0 | 5902556 | 3 | 5.34 | 28.16 | 8.58 |
6 | 3027 | 1695439195 | 2023-09-23 03:19:55 | 650721.1 | 5902559 | 3 | 5.55 | 35.03 | 10.43 |
Column | Description |
---|---|
posID | Unique number for localizations |
tag | 4 digit tag number (character), i.e. last 4 digits of the full tag number |
time | UNIX time (seconds) |
datetime | Datetime in POSIXct (UTC) |
x | X-cordinates in meters (UTM 31 N) |
y | Y-cordinates in meters (UTM 31 N) |
nbs | Number of Base Stations used in calculating coordinates |
varx | Variance in estimating X-coordinates |
vary | Variance in estimating Y-coordinates |
covxy | Co-variance between X- and Y-coordinates |
Add species column (or other relevant columns)
If we are working with multiple species, then we can join the species from the metadata. In this case I want the species to be the first row of the table.
# join with data
all_tags[, tag := as.character(tag)]
data[all_tags, on = "tag", `:=`(species = i.species)]
# make species first column
setcolorder(data, c("species", setdiff(names(data), c("species"))))
We can also add other metadata by merging whatever column we want (e.g. color rings and catch location). However, when working with large data sets, it is advised to only add columns that are necessary. This can also always be done whenever needed. This is just an example and Here we delete the columns again.
# join with metal rings, color rings and catch location
all_tags[, tag := as.character(tag)]
data[all_tags, on = "tag", `:=`(
rings = i.rings,
crc = i.crc,
catch_location = i.catch_location
)]
# delete columns
data[, c("rings", "crc", "catch_location") := NULL]
Save data
At this point it might be good to save the raw data, as extrating the
data from the database can take a long time with big datasets. A
convenient and fast way is to use fwrite
from the
data.table
package. By including yaml = TRUE
we make sure the data stays in the same format when we load it again.
Change the file path when running this example.
# save data
fwrite(data, file = "../inst/extdata/watlas_data_raw.csv", yaml = TRUE)
Check data
Data summary
Here we simply check for how many individuals we have data and how many positions by tag and date we have.
# load data
data <- fread("../inst/extdata/watlas_data_raw.csv", yaml = TRUE)
# data summary
data_summary <- atl_summary(data, id_columns = c("species", "tag"))
# N individuals with tagging data
data_summary |> nrow()
## [1] 8
# N by species
data_summary[, .N, by = species]
## species N
## <char> <int>
## 1: redshank 1
## 2: red knot 1
## 3: bar-tailed godwit 1
## 4: curlew 1
## 5: oystercatcher 1
## 6: turnstone 1
## 7: dunlin 1
## 8: sanderling 1
# show head of the table
data_summary |> knitr::kable(digits = 2)
species | tag | n_positions | first_data | last_data | days_data | min_gap | max_gap | max_gap_f | coverage |
---|---|---|---|---|---|---|---|---|---|
redshank | 3027 | 15859 | 2023-09-23 03:13:22 | 2023-09-23 22:24:26 | 0.8 | 3 | 2523 | 42 min | 0.69 |
red knot | 3038 | 15959 | 2023-09-23 00:00:01 | 2023-09-23 23:59:57 | 1.0 | 3 | 2850 | 47.5 min | 0.55 |
bar-tailed godwit | 3063 | 12614 | 2023-09-23 03:27:49 | 2023-09-23 22:24:55 | 0.8 | 3 | 2541 | 42.4 min | 0.55 |
curlew | 3100 | 8423 | 2023-09-23 04:21:46 | 2023-09-23 21:41:16 | 0.7 | 3 | 16145 | 4.5 hours | 0.41 |
oystercatcher | 3158 | 12533 | 2023-09-23 00:00:01 | 2023-09-23 23:59:57 | 1.0 | 3 | 3060 | 51 min | 0.44 |
turnstone | 3188 | 10195 | 2023-09-23 00:00:45 | 2023-09-23 23:41:50 | 1.0 | 3 | 5742 | 1.6 hours | 0.36 |
dunlin | 3212 | 3856 | 2023-09-23 00:00:00 | 2023-09-23 23:59:56 | 1.0 | 8 | 4784 | 1.3 hours | 0.36 |
sanderling | 3288 | 8131 | 2023-09-23 00:00:03 | 2023-09-23 23:59:54 | 1.0 | 6 | 2622 | 43.7 min | 0.56 |
Column | Description |
---|---|
species | Species |
tag | Tag number |
n_positions | Number of positions |
first_data | Datetime first localization (UTC) |
last_data | Datetime last localization (UTC) |
days_data | Days of data |
min_gap | Minimum time interval between localizations (interval of the tag in seconds) |
max_gap | Maximum time interval between localizations (largest gap in seconds) |
max_gap_f | Maximum time interval between localizations (in seconds, minutes, hours, or days) |
coverage | Coverage (=1 if every min_gap has a localization
between first_data and last_data ) |
Plot the number of positions by day.
# add date
data[, date := as.Date(datetime)] |> invisible()
# N positions by species and day
data_subset <- data[, .N, by = .(tag, date)]
# plot data
ggplot(data_subset, aes(x = date, y = tag, fill = N)) +
geom_tile() +
scale_fill_viridis(
option = "A", discrete = FALSE, trans = "log10", name = "N positions",
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)),
direction = -1
) +
labs(x = "Date", y = "Tag") +
theme_classic()

Number of positions per day by tag
Plot overview of the data
Now we want to see the data on a map. With large datasets it is convenient to plot heatmaps, as plotting for example 40 million points would result in a lot of over plotting and makes the plotting slow. If the dataset is small, one can obviously also plot the data by tag number or in other ways (see vignette plotting data).
# create basemap
bm <- atl_create_bm(data, buffer = 800)
# round data to 1 ha (100x100 meter) grid cells
data[, c("x_round", "y_round") := list(
plyr::round_any(x, 100),
plyr::round_any(y, 100)
)]
# N by location
data_subset <- data[, .N, by = c("x_round", "y_round")]
# plot heatmap
bm +
geom_tile(
data = data_subset, aes(x_round, y_round, fill = N),
linewidth = 0.1, show.legend = TRUE
) +
scale_fill_viridis(
option = "A", discrete = FALSE, trans = "log10", name = "N positions",
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)),
direction = -1
)

Heatmap of all positions