Note: Page numbers referring to figures and tables are followed by an italicized f or t respectively.
activation functions
defined, 178
add_edge function, 41
add_question function, 112
add arithmetic instruction, 15
ADS (Alternate Data Streams), 29
Advanced Persistent Threat 1 attacker group. See APT1 attacker group
advanced persistent threats (APTs), 60
Allaple.A malware family, 157, 157f
Alternate Data Streams (ADS), 29
anti-disassembly techniques, 22
apply_hashing_trick function, 138
APT1 (Advanced Persistent Threat 1) attacker group, 37–39, 38f, 45–47, 45f–47f, 61, 61f, 76, 76f, 86, 222–223
APTs (advanced persistent threats), 60
ArchSMS family of Trojans, 55
area under the curve (AUC), 209–210, 210f, 213
arithmetic instructions, 15, 15t
.asarray method, 142
assembly language, defined, 12. See also x86 assembly language
AT&T, 43
AT&T syntax, 13
attributes, 37
adding to nodes and edges, 42
AUC (area under the curve), 209–210, 210f, 213
autoencoder neural networks, 194–195, 195f
automatic feature generation, 188
backpropagation, 190–192, 190f–191f
bag of features model, 62–64, 63f
features, defined, 62
Jaccard index and, 65
bar charts (histograms), 168–170, 168f–169f
base virtual memory address, 6
bias parameter, 104
bipartite networks, 37–39, 38f
bitcoin mining, 158, 160–161, 168f, 172f–173f, 173
callbacks
built-in (Keras package), 212
creating shared callback relationship network, 51–54
capstone module, 20
Carerra, Ero, 5
cmp instruction, 18
CNNs (convolutional neural networks), 193–194, 194f
coarsenings, 46
color attribute, 49
comment_sample function, 82–84
COMMENT mode, 229
compile method, 202
compressed_data_weight parameter, 103
compressed_data parameter, 103–104
conditional branches, defined, 15
control flow, 17
convolutional neural networks (CNNs), 193–194, 194f
general-purpose registers, 13–14
stack and control flow registers, 14–15
cross_validation module, 151
cross-validation, 150–153, 151f, 153f
CuckooBox software platform, 27, 33–34, 59
“curse of dimensionality,” 92
cv_evaluate function, 151
dapato malware family, 62, 67f–68f, 70f–72f
data movement instructions, 15–20, 16t
control flow graphs, 19–20, 19f
control flow instructions, 17–18
applying to malware, v
.data section (in PE file format), 4
dateutil package, 164
dec arithmetic instruction, 15
decision boundaries, 93–98, 95f–98f
identifying with k-nearest neighbors, 97–98, 97f–98f
identifying with logistic regression, 96–97, 96f–97f
overfit machine-learning model, 100, 101f
underfit machine-learning model, 99, 99f
well-fit machine-learning model, 100, 100f
decision thresholds, 149
DecisionTreeClassifier class, 130
decision trees, 109–115, 109f–110f, 113f–114f
decision tree–based detectors, 129
importing modules, 129
initializing sample training data, 130
instantiating classes, 130
follow-up questions, 111
limiting depth or number of questions, 111–112
deep learning, 175–197, 216. See also neural networks
automatic feature generation, 188
building neural networks, 182–188
neurons, 176
training neural networks, 189–193
types of neural networks, 193–197
universal approximation theorem, 181–182
deep neural networks. See neural networks
describe method, 159
detection accuracy evaluation, 119–126, 146–153
base rates and precision, 124–126
effect of base rate on precision, 124–125
estimating precision in deployment environment, 125–126
with cross-validation, 150–153, 151f, 153f
neural networks, 209–211, 210f–211f
possible detection outcomes, 120, 120f
with ROC curves, 123–124, 123f, 147–150, 150f
true and false positive rates, 120–124
relationship between, 121–122, 121f–122f
directed graphs, 180
distance functions, 107
DLLs (dynamic-link libraries), 13
DOS header (in PE file format), 3
.dot format, 42
dynamically downloaded data, 22–23
bag of features model, 63
dataset for, 222
for disassembly, 26
for malware data science, 26
typical malware behaviors, 27
limitations, 33
dynamic API call–based similarity, 72, 72f
dynamic-link libraries (DLLs), 13
EAX register, 14
EBP register, 14
EBX register, 14
ECX register, 14
edges, 37
adding attributes, 42
adding to shared relationship networks, 41
adding visual attributes to, 48–51
EDX register, 14
EFLAGS register, 15
ELU activation function, 179t
epochs parameter, 206
ESP register, 14
euclidean_distance function, 107
Euclidean distance, 107
evaluate function, 148
evaluating malware detection systems. See detection accuracy evaluation
export_graphviz function, 132
extract_features function, 204–205
ExtractImages helper class, 56–57
fakepdfmalware.exe, 7
false negatives, defined, 120, 120f
base rates and precision, 124–126
false positive rate, 121
relationship between true and false positive rates, 121–122, 121f–122f
feature_extraction module, 129
Import Address Table features, 136
machine learning–based malware detectors, 90–92, 141–142
Portable Executable header features, 135–136
string features, 135
training neural networks with Keras package, 203–204
why all possible features can’t be used at once, 137–138
feature hashing. See hashing trick
feature spaces, 93–98, 94f–98f
feed-forward neural networks, 181, 181f, 193
fit_generator function, 204–206, 208, 212, 214
flags, defined, 15
format strings, 70
Gaussian activation function, 179t
generative adversarial networks (GANs), 195–196
generator parameter, 206
get_string_features function, 141–142, 144
get_strings function, 82
get_training_data function, 143
get_training_paths function, 143
GETMAIL utility, 223
–G flag, 44
converting extracted .ico files to .png graphics, 8
creating directory to hold extracted images, 7–8
extracting image resources using wrestool, 8
GraphViz, 76
decision tree–based detectors, 131–133, 132f
malware network analysis, 43–51
adding visual attributes to nodes and edges, 48–51
parameters, 44
similarity graphs, 76
ground_truth variable, 130
hashing trick (feature hashing), 138–141
hidden layer, 181
histograms (bar charts), 168–170, 168f–169f
hostname_projection argument, 225
IAT. See Import Address Table
icoutils toolkit, 5
IDA Pro, 12
.idata section (imports) (in PE file format), 4
Identity activation function, 178t
Import Address Table (IAT), 4
extracting features, 136
similarity analysis based on, 71, 71f
inc arithmetic instruction, 15
information gain, 113
instruction sequence–based similarity, 68f
Intel syntax, 13
Internet Relay Chat (IRC), 2
int function, 148
inverted indexing, 82
ircbot.exe bot, 2
jaccard_index_threshold argument, 227–228
jaccard function, 73
building similarity graphs, 73–75
dynamic API call–based similarity, 72
instruction sequence–based similarity, 68
scaling similarity comparisons, 77
strings-based similarity, 70
jge instruction, 18
jmp instructions, 18
Kaspersky, 62
Keras package, building neural networks with, 199–214
compiling model, 202–203, 202f
defining architecture of model, 200–202
evaluating model, 209–211, 210f–211f
layers, 200
saving and loading model, 209
syntaxes, 200
training model, 203–209, 211–214
built-in callbacks, 212
custom callbacks, 213–214, 214f
data generators, 204–207, 207f
validation data, 207–209, 208f
keyloggers, 158, 168f, 172f–173f, 173
K-fold cross-validation, 151
k-nearest neighbors, 105–109, 106f, 108f
identifying decision boundaries with, 97–98, 97f–98f
logistic regression vs., 108–109
math behind, 107
pseudocode for, 107
when to use, 109
lea instruction, 16
Leaky ReLU activation function, 179t
learned_parameters parameter, 103
linear disassembly, 12
limitation of, 12
LOAD mode, 229
logistic_function function, 103–104, 104f
logistic_regression function, 103
logistic regression, 102–105, 103f–104f, 154
gradient descent, 105
identifying decision boundaries with, 96–97, 96f–97f
k-nearest neighbors vs., 108–109
limitation of, 102
plot of logistic function, 104f
pseudocode for, 103
when to use, 105
long short-term memory (LSTM) networks, 196
Los Alamos National Laboratory, 41
machine learning–based malware detectors, 89–117, 127–154
building basic detectors, 129
collecting training examples, 90–91
designing good features, 92
building real-world detectors, 141–146
running detector on new binaries, 144
dataset for, 224
decision boundaries, 93–98, 95f–98f
evaluating detector performance, 146
cross-validation, 150–153, 151f, 153f
splitting data into training and test sets, 148–149
Import Address Table features, 136
Portable Executable header features, 135–136
string features, 135
why all possible features can’t be used at once, 137–138
feature spaces, 93–98, 94f–98f
overfitting and underfitting, 98–99, 99f–101f
supervised vs. unsupervised algorithms, 93
terminology and concepts, 128–129
traditional algorithms vs., 90
types of algorithms, 101, 102f
decision trees, 109–115, 109f–110f, 113f–114f
k-nearest neighbors, 97–98, 97f–98f, 105–109, 106f, 108f
logistic regression, 96–97, 96f–97f, 102–105, 103f–104f
malware_projection argument, 52, 225–227
malware detection evaluation. See detection accuracy evaluation
malware network analysis, 35–58, 36f
attributes, defined, 37
bipartite networks, 37–39, 38f
creating shared callback relationship network, 51–54, 225–226, 226f
parsing command line arguments, 52
saving networks to disk, 54
creating shared image relationship networks, 54–58, 55f, 226–227
extracting graphical assets, 57
parsing initial argument and file-loading code, 55–57
saving networks to disk, 58
edges, defined, 37
GraphViz, creating visualizations with, 43–51
parameters, 44
NetworkX library, creating networks with, 40–43
adding attributes, 42
adding nodes and edges, 41
saving networks to disk, 42–43
nodes, defined, 37
projections, 38
shared code analysis and, 60–61
visualization challenges, 39–40
distortion problem, 39–40, 40f
force-directed algorithms, 40
malware samples, 61–62, 222–224
modified system objects, 30–32
limitations of, 33
MAPIGET utility, 223
Mastercard, iii
matplotlib library, 148–150, 162–167, 162f
plotting ransomware and worm detection rates, 165–167, 166f
plotting ransomware detection rates, 164–165, 165f
plotting relationship between malware size and detection, 162–163
max function, 160
memory cells, 196
minhash approach
combined with sketching, 79
minhash function, 82
ModelCheckpoint callback, 212
Model class, 201
mutexes, defined, 32
my_generator function, 205, 207–208
Nemucod.FG malware family, 157, 157f
creating shared relationship networks, 41–42
overview, 41
saving networks to disk, 42–43
automatic feature generation, 188
building
with four neurons, 186–188, 186f–187f, 187t
with three neurons, 184–186, 185f–186f, 185t
with two neurons, 182–184, 182f–184f, 183t–184t
building with Keras package, 199–214
compiling model, 202–203, 202f
defining architecture of model, 200–202
evaluating model, 209–211, 210f–211f
saving and loading model, 209
training model, 203–209, 211–214
dataset for, 224
neurons, 176
anatomy of, 177–180, 177f, 178t–180t
using backpropagation, 190–192, 190f–191f
using forward propagation, 189–190
vanishing gradient problem, 192–193
feed-forward, 193
generative adversarial, 195–196
recurrent, 196
universal approximation theorem, 181–182, 182f
neurons, 176
anatomy of, 177–180, 177f, 178t–180t
dynamic API call–based similarity, 72
instruction sequence–based similarity, 67–68
nodes, 37
adding attributes, 42
adding to shared relationship networks, 41
adding visual attributes to, 48–51
objective function, 189
optional header (in PE file format), 3–4
output_dot_file argument, 227–228
output_file argument, 52, 225, 227
overfit machine-learning models, 98–99, 101f
overlap parameter, 44
packing, 21
difficulty of disassembling packed malware, 26
legitimate uses of, 22
filtering data using conditions, 161
manipulating DataFrame, 159–161
Parkour, Mila, 61
pasta malware family, 62, 67f–68f, 70f–72f
PE. See Portable Executable file format
PE (Portable Executable) header, 3, 135–136
disassembly using, 20
opening and parsing files, 5–6
pulling information from PE fields, 6
pefile PE parsing module, 51–52
persistent malware similarity search systems, 79–87
building
allowing users to search for and comment on samples, 82–84
implementing database functionality, 80–81
importing packages, 80
indexing samples into system’s database, 82
loading samples, 85
obtaining minhashes and sketches, 81–82
parsing user command line arguments, 84–85
commenting on samples, 86
searching for similar samples, 86
wiping database, 86
pick_best_question function, 112–113
.png format, 43
pooling layer, 194
Portable Executable (PE) file format, 2–5
dissecting files using pefile, 5–7
entry point, 3
DOS header, 3
PE header, 3
sections, defined, 4
Portable Executable (PE) header, 3, 135–136
position independence, 5
effect of base rate on, 124–125
estimating in deployment environment, 125–126
predict_proba method, 144, 149
PReLU activation function, 179t
program stack, defined, 14
projected_graph function, 54
projections, 38
random forest
random forest–based detectors, 141–146
running detector on new binaries, 144
RandomForestClassifier class, 143, 152
ransomware, 30–31, 31f, 155–158, 156f, 158, 164–168, 165f–166f, 168f, 172–173, 172f–173f
.rdata section (in PE file format), 4
Receiver Operating Characteristic curves. See ROC curves
rectified linear unit (ReLU) activation function, 177f, 178t, 180, 182f, 183–185, 201
recurrent neural networks (RNNs), 196
registry keys, 32
.reloc section (in PE file format), 5
ReLU (rectified linear unit) activation function, 177f, 178t, 180, 182f, 183–185, 201
ResNets (residual networks), 196–197
resource_projection argument, 52, 227
resource obfuscation, 22
reverse engineering, 12
anti-disassembly techniques, 22
dynamic analysis for, 26
methods for, 12
shared code analysis, 60
using pefile and capstone, 20–21
RNNs (recurrent neural networks), 196
ROC (Receiver Operating Characteristic) curves, 123–124, 123f, 126, 147–150, 230–231, 231f
cross-validation, 151–152, 153f
neural networks, 209–210, 210f–211f
.rsrc section (resources) (in PE file format), 4–5
sandbox, 26
Sanders, Hillary, 216
savefig function, 165
scan_file function, 144
scikit-learn (sklearn) machine learning package, 127–128
building basic decision tree–based detectors, 129–134
building random forest–based detectors, 141–146
evaluating detector performance, 146–153
terminology and concepts, 128–129
classifiers, 129
fit, 129
prediction, 129
vectors, 128
seaborn package, 168–174, 168f
creating violin plots, 172–174, 172f–173f
plotting distribution of antivirus detections, 169–172, 169f, 171f
SEARCH mode, 229
section headers (in PE file format), 4–5
.data section, 4
.idata section (imports), 4
.rdata section, 4
.reloc section, 5
.rsrc section (resources), 4–5
.text section, 4
security data scientists, 215–220
expanding knowledge of methods, 219–220
paths to becoming, 216
obsession with results, 219
open-mindedness, 218
skepticism of results, 219
willingness to learn, 216
data feed identification, 218
dealing with stakeholders, 217
deployment, 218
problem identification, 217–218
solution building and evaluation, 218
self-modifying code, 12
set_axis_labels function, 172
shared attribute analysis. See malware network analysis
shared code analysis (similarity analysis), 59–87, 60, 61f
bag of features model, 62–64, 63f
features, defined, 62
dataset for, 223
persistent malware similarity search systems, 79–87
allowing users to search for and comment on samples, 82–84
commenting on samples, 86
implementing database functionality, 80–81
importing packages, 80
indexing samples into system database, 82
loading samples, 85
obtaining minhashes and sketches, 81–82
parsing user command line arguments, 84–85
searching for similar samples, 86
wiping database, 86
scaling similarity comparisons, 77–79
difficulties with, 77
declaring utility functions, 73–74
importing libraries, 73
iterating through pairs, 75
Jaccard index threshold, 73
parsing user’s command line arguments, 74
visualizing graphs, 76
similarity matrices, 66–72, 66f–67f
concept of, 66
dynamic API call–based similarity, 72, 72f
Import Address Table–based similarity, 71, 71f
instruction sequence–based similarity, 67–70, 68f
strings-based similarity, 70–71, 70f
shared image relationship networks, 54–58, 55f, 226–227
extracting graphical assets, 57
parsing initial argument and file-loading code, 55–57
saving networks to disk, 58
shelve module, 80
show function, 152, 163, 165, 168
Sigmoid activation function, 180t, 201
similarity analysis. See shared code analysis
declaring utility functions, 73–74
importing libraries, 73
iterating through pairs, 75
Jaccard index threshold, 73
parsing user’s command line arguments, 74
visualizing graphs, 76
similarity matrices, 66–72, 66f–67f
dynamic API call–based similarity, 72, 72f
Import Address Table–based similarity, 71, 71f
instruction sequence–based similarity, 67–70, 68f
strings-based similarity, 70–71, 70f
sklearn. See scikit-learn machine learning package
skor malware family, 62, 67f–68f, 70f–72f
Softmax activation function, 180t
Sophos, 216
splines parameter, 44
split_regex expression, 203–204
stack, defined, 16
stack management registers, 14–15
dataset for, 222
disassembly and reverse engineering, 12
methods for, 12
using pefile and capstone, 20–21
anti-disassembly techniques, 22
dynamically downloaded data, 22–23
resource obfuscation, 22
Portable Executable file format, 2–5
std function, 160
Step activation function, 179t
steps_per_epoch parameter, 206
strings
defined, 8
feature extraction, 135, 141–142
analyzing printable strings, 8–10
information revealed through, 8
printing all strings in a file to terminal, 8–9
strings-based similarity, 70–71, 70f
sub arithmetic instruction, 15
summary function, 202–203, 202f
supernodes, 46
suspicious_calls parameter, 103–104
suspiciousness scores, 121–122, 121f–122f
Target, iii
target_directory argument, 227–228
target_path argument, 52, 225, 227
.text section (in PE file format), 4
threat scores, 147
.todense method, 142
train_detector function, 143
training_examples variable, 130
tree module, 129
Trojans, 54–55, 55f, 158–161, 168f, 172f–173f, 173
true negatives, defined, 120, 120f
base rates and precision, 124–126
relationship between true and false positive rates, 121–122, 121f–122f
true positive rate, 121
underfit machine-learning models, 98–99, 99f
universal approximation theorem, 181–182, 182f
UPX packer, 29
validation_labels object, 210–211
validation_scores object, 210
vanishing gradient problem, 192–193
vbna malware family, 62, 67f–68f, 70f–72f
vectors, 128
violin plots, 172–174, 172f–173f
virtual size, 6
basic machine learning–based malware detectors, 131–133, 132f
dataset for, 224
malware network analysis
challenges to, 40f
creating with GraphViz, 45f–47f
network analysis
ROC curves, 149, 150f, 152–153, 153f
shared code analysis, 76
using matplotlib, 162–167, 162f
plotting ransomware and worm detection rates, 165–167, 166f
plotting ransomware detection rates, 164–165, 165f
plotting relationship between malware size and detection, 162–163
filtering data using conditions, 161
manipulating DataFrame, 159–161
creating violin plots, 172–174, 172f–173f
plotting distribution of antivirus detections, 169–172, 169f, 171f
webprefix malware family, 62, 67f–68f, 70f–72f
weight attribute, 37
Wells Fargo, iii
Wikipedia, 220
wipe mode, 229
work method, 57
worms, 158–161, 165–167, 166f, 168f, 172, 172f–173f
wrestool tool, 55
downloading, 8
extracting image resources, 7–8
arithmetic instructions, 15, 15t
general-purpose registers, 13–14
stack and control flow registers, 14–15
data movement instructions, 15–20, 16t
basic blocks and control flow graphs, 19–20, 19f
control flow instructions, 17–18
dialects of, 13
shared code analysis, 67
xtoober malware family, 62, 67f–68f, 70f–72f
yield statement, 205