Computer vision capabilities have rapidly been advancing and are expected to become an important component to incident and disaster response. However, the majority of computer vision capabilities are not meeting public safety’s needs, such as support for search and rescue, due to the lack of appropriate training data and requirements. For example in 2019, a leading computer vision benchmark has mislabeled a flooded region as a “toilet,” or a highway surrounded by flooding as a “runway.” In response, we’ve developed a dataset of images collected by the Civil Air Patrol of various natural disasters. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features, which are rarely featured in computer vision benchmarks and datasets. This task invites researchers to work on this new domain to develop new capabilities and close the gap in performance to essentially label video clips with the correct disaster-related feature(s).The pilot for the DSDI task was successfully conducted in TRECVID 2020, and we will continue the task this year.
We will also make the video testing dataset from 2020 and its annotation available to teams to train their systems.
The initial release of the LADI dataset limited was focused on the following set of criteria. The crowd-source Human Annotations were significantly prioritized based on this criteria.
|Administrative Boundaries||alabama, louisiana, florida, georgia, louisiana, mississippi, north carolina, puerto rico, south carolina, texas, virgin islands, virginia|
|Months||March - November|
|Years||2015 - 2019|
|Altitude (AGL ft)||<= 1000|
|Image size (MB)||>= 4|
Since 2015, each of these locations had a FEMA major disaster declaration for a hurricane or flooding. Three of these (Louisiana, South Carolina, and Texas) also had major declaration for flooding during the Atlantic hurricane season months of June to November. The scope also included four of the five locations with the most images collected from 2015-2018. The other location, but not represented in the scope, was California with its significant wild fire activity.
The lower altitude criteria is intended to further distinguish the LADI dataset from satellite or "top down" datasets and to support development of computer vision capabilities for small drones operating at low altitudes. A minimum image size was selected to maximize the efficiency of the crowd source workers; lower resolution images are harder to annotate. 4 MB was slightly less than the annual file size average.
For more information about LADI, please refer to the github organization.
A testing dataset of about 5 hours of video will be distributed to participants according to the published schedule. The testing data will be choosen from a recent natural disaster event operational videos and will be similar to the testing dataset used in last year's task. The testing data will be segmented into small video clips of about maximum 60 sec each and a master shot reference table to map the starting and ending of each video clip within the original raw video will be provided to participants to use it in their run submissions.
The LADI dataset also includes machine-generated labels from commercial and open-source image recognition tools to provide additional context. MIT LL had run classifiers trained on ImageNet, Places365, and various commercial classifiers. Each image will be tagged with the top 10 labels from each classifier. The open-source image recognition classifiers were generated using the Lincoln Laboratory Supercomputing Center, and the commercial classifiers were run on the respective commercial vendor’s platform. In particular, we used the pretrained implementation of Inception-ResNetV2 trained on the ImageNet dataset in keras, and the pretrained implementation of ResNet50 trained on Places365-Standard in PyTorch. We provide these annotations based on practices associated with the YouTube-8M dataset. Here are a couple examples:
|ANNOTATION_MACHINE||Places365||airfield||softmax weight for label class "airfield"|
|ANNOTATION_MACHINE||Imagenet||tench, Tinca tinca||softmax wieght for label class "tench, Tinca tinca"|
As part of the datasset, we extract and process the metadata and Exif information from each image. This includes information such as date and time information, latitude and longitude coordinates, and camera settings. The specific Exif data available varies across the images. Here are a few examples:
|METADATA||File||fieldpath||path to file location in filesystem|
|METADATA||File||HDF5||location of file in HDF5|
|METADATA||File||filesize||size in bytes|
|METADATA||EXIF||ImageHeight||height of image in pixels|
|METADATA||EXIF||ImageWidth||width of image in pixels|
|METADATA||EXIF||GPSLatitude||latitude of image from GPS|
|METADATA||EXIF||GPSLongitude||longitude of image from GPS|
Specifically from 2015-2019, we've observed more then ten camera models and an overwhelming majority of images were collected at low altitudes of 609 meters (2000 feet) and below. Additionally, the largest image size exceeded 20 MB with an average file size of at least 5 MB. The average file size was consistent over the years.
Here are examples of unannotated imagery collected by CAP (Civil Air Patrol) and hosted by FEMA. Note that the lighting, orientation, perspective, and resolution varies across the examples. These variations are a key component to the LADI dataset, as they these variations are simply the reality of operational imagery. Any technology to support disaster response will need to handle these variations.
As proposed by Heitz and Koller, a dataset can include things or stuff . A thing is something that can be easily and discretely labeled, whereas stuff are less discrete and many have no clear boundaries. For example as of 2019, other dataset such as xView and COCO consist of only things, such as book or yacht. Due to one of our public safety-focused objective, we include both thing and stuff labels. For example, a building is a thing, but there can be an additional label of damage, which is stuff.
We defined a reasonable label set that is feasible for crowd sourcing while meeting public safety's needs. The dataset currently employs a hierarchical labeling scheme of a five coarse categorical and then more specific annotations for each category. The five coarse categories are:
For each of the coarse cateogries, there are 4-9 more specific annotations:
|flooding / water damage||grass||building||boat||lake / pond|
|landslide||lava||dam / levee||car||ocean|
|rubble / debris||sand||utility or power lines / electric towers||river / stream|
|smoke / fire||shrubs||railway|
|snow / ice||wireless / radio communication towers|
Each team may submit a maximum of 4 prioritized runs per training type. The submission formats are described below.
Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.Counter -v YourSubmision.xml.
|Speed||Clock time per inference, reported in seconds (to one decimal place), must be provided by participants for each run.|
|Mean Average Precision||Average precision will be calculated for each feature, and the mean average precision reported for each submission.|
|Recall||True positive, true negative, false positive, and false negative rates.|
News magazine, science news, news reports, documentaries, educational programming, and archival video
Airport Security Cameras & Activity Detection
Video collections from News, Sound & Vision, Internet Archive,
Social Media, BBC Eastenders