Dataset Manipulation¶
Datasets are folders holding samples to be analyzed. They aim to organize input executables according to a strict structure, including packing labels, original names, and so forth.
Structure¶
A (normal) dataset folder holds the following files/folders:
[name]
+-- files
| +-- {executables, renamed to their SHA256 hashes}
+-- data.csv # metadata and labels of the executable
+-- metadata.json # simple statistics about the dataset
A fileless dataset has the following structure:
[name]
+-- data.csv # metadata and labels of the executable
+-- features.json # dictionary of selected features and their descriptions
+-- metadata.json # simple statistics about the dataset
Fileless datasets can be used when we are sure of the features set to be computed. It can also save substantial disk space (such a dataset will at most take a few megabytes of data while the executables from the original one could occupy gigabytes).
Tool¶
A dedicated tool called dataset
is provided with the Packing Box to manipulate datasets. Its help message tells everything the user needs to get started.
$ dataset --help
[...]
This tool aims to manipulate a dataset in many ways including its creation, enlargement, update, alteration, selection,
merging, export or purge.
[...]
CMD command to be executed
[create/modify/delete]
alter alter the target dataset with a set of transformations
convert convert the target dataset to a fileless one
edit edit the data file
export export packed samples from a dataset or export the dataset to a given format
fix fix a corrupted dataset
ingest ingest samples from a folder into new dataset(s)
make add n randomly chosen executables from input sources to the dataset
merge merge two datasets
purge purge a dataset
remove remove executables from a dataset
rename rename a dataset
revert revert a dataset to its previous state
select select a subset of the dataset
update update a dataset with new executables
[read]
browse compute features and browse the resulting data
list list all the datasets from the workspace
plot plot something about the dataset
preprocess preprocess the input dataset given preprocessors
show get an overview of the dataset
view view executables filtered from a dataset
[...]
Visualization¶
The list of datasets in a folder can be viewed by using the command dataset list [folder]
(dataset list
will display all the datasets in the current folder).
Among the other commands, we can show
the metadata of the dataset for describing it, limiting the number of displayed records if required (-l
/--limit
; default is 10 records) and/or sorting the metadata per category (-c
/--categories
).
$ dataset show test-upx
Dataset characteristics
• #Executables: 10
• Format(s): PE32
• Packer(s): upx
• Size: 6MB
• Labelled: 100.00%
• With files: yes
Executables per size
Size Not
Range Packed % Packed %
─────────────────────────────────────────────
<20kB 0 0.00 0 0.00
20-50kB 0 0.00 0 0.00
50-100kB 2 33.33 1 25.00
100-500kB 1 16.67 2 50.00
500kB-1MB 2 33.33 1 25.00
>1MB 1 16.67 0 0.00
Total 6 60.00 4 40.00
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sources:
[0] ~/.wine64/drive_c/windows [1] ~/.wine32/drive_c/windows
Executables' metadata and labels
Hash Path Creation Modification Label
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1d8bf746ba84e321d1afd29c48a6f152e3195cb0c92d5559d75a455d58… /home/user/.wine32/drive_c/windows/system32/tzres.dll 01/10/23 01/10/23 -
21383926f0b5b3909f03f61f1df8b14c7d8f691136e5000c1612016638… /home/user/.wine32/drive_c/windows/system32/strmdll.dll 01/10/23 01/10/23 -
3d2b2fd0d6f3fcb3a0107caa5679ace01f9d9cdb3c1ed37de66a8eb496… /home/user/.wine32/drive_c/windows/system32/d3dx9_33.dll 01/10/23 01/10/23 -
694b7a2075a9ec584346af7db65aacf55afce314e70753d0c88b5bb59d… /home/user/.wine32/drive_c/windows/system32/itircl.dll 01/10/23 01/10/23 -
695f311494155ba4890a2543d4398f97782f9a542b9ebf0978ff356338… /home/user/.wine32/drive_c/windows/system32/concrt140.dll 01/10/23 01/10/23 -
afa6815e81bd1da93d05465b21698da5d8decd177ac39d8fa2f724bbe4… /home/user/.wine32/drive_c/windows/system32/dispex.dll 01/10/23 01/10/23 upx
b73e7f86951d8f8a4b881cb69f6dbda28950b6015c558949f7a2d97781… /home/user/.wine32/drive_c/windows/system32/start.exe 01/10/23 01/10/23 upx
c6e875d0be29bfefc9a5a517e108b395f1404cdb9e80cb5c1c3604f457… /home/user/.wine32/drive_c/windows/system32/d3dcompiler_41… 01/10/23 01/10/23 -
e88f4aedd45410f2a44b94ff928529d9760bd1d35c09a818aa30395790… /home/user/.wine32/drive_c/windows/system32/traffic.dll 01/10/23 01/10/23 upx
ecd981b59b54c3e701079e478e908dd6b68f6ee8e8f7d319ba698c10a1… /home/user/.wine32/drive_c/windows/system32/msvcm80.dll 01/10/23 01/10/23 upx
While the show
command displays information about a dataset, it is not aimed to filter and observe records according to criteria. This can be done by using the view
command which uses Pandas' query method through the --query
option.
The following commands are aimed to parse a dataset:
browse
: this opens dataset's content with Visidata, showing data including computed features ; checkout thefeatures.yml
configuration file configuration file for examples of feature definitions.export
: this allows either to export packed samples only or to export the dataset to another file format (seedsff
package for the available formats).list
: this lists all the existing datasets, showing minimal information including the number of executables, the size, whether files are contained, in-scope executable formats and the numbers of packed samples per packer used.plot
: this allows to make plots representing characteristics of the dataset (e.g. the distribution of labels or information gains of the computed features).preprocess
: this preprocesses features data using an input preprocessor and then opens the result with Visidata, allowing to inspect preprocessed data that is fed to learning models.show
: this displays dataset information (as illustrated here above).view
: this allows to view samples in the terminal filter out with the input filter if any.
Manipulation¶
The rest of the other commands are aimed to manipulate datasets in different ways. The most trivial commands are:
purge
: this trivial command purges the target dataset of all its records and files.rename
: this other trivial command renames the target dataset ; it differs from a simple 'mv [...]
' in that it also adapts backup copies of the dataset.
For adding/removing executables to/from the dataset, a few commands are available:
ingest
: this allows to ingest samples from an external folder, specifying an input JSON of labels (-l
/--labels
) or enabling (super)detection (-d
/--detect
-- that is, using a combination of detectors with voting for determining the final label).make
: this adds a given number (-n
/--number
) of new executables to the dataset, optionally only from some given executable formats (-f
/--formats
), randomly packing some of them such that the dataset is balanced (using the-b
/--balance
option will balance between packer labels, not simply between packed and not packed) and with every applicable packer or only the given ones (-p
/--packer
).merge
: this merges the second input dataset into the first one.remove
: this allows to remove executables from the dataset according to criteria relying on Pandas' query method through the--query
option.select
: this creates a new dataset with a selected subset of the target dataset, using criteria relying on Pandas' query method through the--query
option.update
: this updates the current dataset with executables from the target folders (-s
/--source-dir
), either packed or not, labelling them with an input JSON (-l
/--labels
) or enabling (super)detection (-d
/--detect
).
It is also possible to apply corrective/evolutive measures on a dataset:
alter
: this will apply the selected alterations in order to modify samples from the target dataset ; checkout thealterations.yml
configuration file for examples of alteration definitions.convert
: this makes possible to convert a dataset with files (which offers the possibility to select features multiple times from the same dataset) to a dataset without files (that is, with its features computed once and for all, not keeping the related files and then possibly saving a large amount of disk space) ; checkout thefeatures.yml
configuration file configuration file for examples of feature definitions.edit
: this allows to edit dataset's data contained in itsdata.csv
with Visidata.fix
: this allows to fix a corrupted dataset, especially when JSON files do not match the content of thefiles
folder anymore, which can be fixed either by relying on the content of this folder (with the-f
/--files
option) or on hashes contained in the labels JSON, eventually applying detection (-d
/--detect
) to fix labels.revert
: this allows to revert the dataset to its state before the previous operation, with a maximum of 3