Scheme File Definition
How to Define a scheme for a Threat Model Document
Overview
threatware tries to be format agnostic for the structure of the threat model document. This means you can structure your threat model how you want. In order to support this, threatware needs a file (called a ‘scheme’) that represents the format of your threat model document. This is exactly how the default template works, and you can see the default ‘scheme’ for Confluence and Google Docs to look at as examples.
The ‘scheme’ is a parameter that is passed to threatware actions, and is the basis for the convert
action to convert a threat model document to a yaml
representation (which then becomes the common format that all other actions use for processing).
Unfortunately, there is no standardised language to describe the format of a generic document and assign attibutes to parts of it. Standard document formats like HTML or Markdown aren’t expressive enough to allow capturing detailed information that might be nested in a document, and other schema langauges are specific to their corressponding format e.g. XSD for XML. So threatware uses a custom language to describe threat model documents - which is by no means an ideal solution.
The goal of this section is to give you a high level overview of the format of ‘scheme’ files and provide an example of how to extract a table from a Google Doc. Whilst it isn’t expected most people will want to create ‘scheme’ files from scratch, the idea is to explain the basics, which should allow modifications of existing ‘scheme’ files.
Scheme format (high-level)
The job of the scheme format is to map the contents of a document to a yaml
representation of that document. The yaml
representation does not need to include all the information in the document, just the information you want to store, or you want to validate (and some information like diagrams cannot currently be captured).
The ‘scheme’ file itself is a yaml
file and at the highest level a scheme document looks like:
1---
2scheme:
3 map:
4 get-data:
5 map-data:
6 output-data:
It’s helpful to think of threatware as having an internal document conversion process that is consuming this configuration, starting with get-data:
, then taking the result and passing it to map-data:
, and then taking the result and passing it to output-data:
.
Where
2scheme:
3 map:
4 get-data:
|
Configures where in the document to get data from. This data will be passed as input to the processing of the |
5 map-data:
|
Configures how to map the data from |
6 output-data:
|
Configures any data processing to perform on the key/value pairs output by |
Notably, these 3 configurations can be used recursively under a child key of the map-data:
configuration, which allows us to create arbitrarily nested yaml
(explained more later) to represent the data in the document.
Now let’s add the next level of configuration keys for each of the above:
1---
2scheme:
3 map:
4 get-data:
5 query:
6 map-data:
7 - key:
8 value:
9 - key:
10 value:
11 output-data:
Where:
4 get-data:
5 query:
|
Will contain the configuration of how to select data from the document. |
6 map-data:
7 - key:
8 value:
9 - key:
10 value:
|
A list of |
It’ll be easiest to understand the next level of configuration by going through an example of extracting table data from a document.
Extracting a table
Since a lot of information in documents is often in a table, let’s imagine we have a document like the following:
Document
Threat Model
Description
This is a threat model for a series of data flows.
Threats Table
Data Flow |
Threat Description |
Control |
Risk Rating |
---|---|---|---|
DF1 |
No use of SSL |
Use TLS |
Low |
DF2 |
No encryption |
Use encryption |
Low |
DF3 |
Unauthorised access |
IP Allowlist |
Medium |
Conclusion
The overall risk is Low.
What we want is to extract the information in the table into a yaml
document that will look like:
threats:
- data-flow: DF1
threat-description: No use of SSL
control: Use TLS
risk-rating: Low
- data-flow: DF2
threat-description: No encryption
control: Use encryption
risk-rating: Low
- data-flow: DF3
threat-description: Unauthorised access
control: IP Allowlist
risk-rating: Medium
The high level ‘scheme’ file will start out like this:
1---
2scheme:
3 version: 1.0
4 document-storage: googledoc
5 map:
6 map-data:
7 output-data:
8 type: dict
Let’s explain the new lines, line-by-line:
2scheme:
3 version: 1.0
|
This is just the version number of the scheme file format and can only be |
4 document-storage: googledoc
|
This tells threatware the document is a Google Doc ( |
7 output-data:
8 type: dict
|
This tells threatware the content will be output as a |
Note, we removed the get-data:
configuration as that isn’t required under the map:
configuration, because all data will be passed to the rest of the configuration (the job of get-data:
is to query for a subset of data, but since we want to pass all data we can remove the get-data:
configuration).
Now let’s start specifying how to actual get to the data we want. Remember at this stage, the entire HTML representation of the Google Doc will be passed to the configuration of map-data:
.
1---
2scheme:
3 version: 1.0
4 document-storage: googledoc
5 map:
6 map-data:
7 - key:
8 name: threats
9 - value:
10 get-data:
11 - query:
12 type: "html-table"
13 xpath: "//html/body/h2[text()='Threats Table']/following::table[1]"
14 output-data:
15 type: dict
Let’s explain the new lines, line-by-line:
6 map-data:
7 - key:
8 name: threats
|
This is the name of the |
9 - value:
10 get-data:
|
The |
9 - query:
10 type: "html-table"
|
This is the type of data we want to extract from the document. See Query types for a list of supported types. |
13 xpath: "//html/body/h2[text()='Threats Table']/following::table[1]"
|
This is the HTML XPath query to the HTML table to get data from. XPath was chosen as the input is HTML and it is the most compact way to locate an HTML element. There is a learning curve to using it, but the examples provided should be fairly easy to alter for your own documents. |
The above get-data:
configuration extracts all the data in the table (including the header row), but we still need to assign each row in the table to the list of child keys we want. To do this we need to add map-data:
configuration to take the data from get-data:
and map it to the keys.
1---
2scheme:
3 version: 1.0
4 document-storage: googledoc
5 map:
6 map-data:
7 - key:
8 name: threats
9 - value:
10 get-data:
11 - query:
12 type: "html-table"
13 xpath: "//html/body/h2[text()='Threats Table']/following::table[1]"
14 map-data:
15 - key:
16 name: "data-flow"
17 value:
18 get-data:
19 query:
20 type: html-table-row
21 column-num: 0
22 - key:
23 name: "threat-description"
24 value:
25 get-data:
26 query:
27 type: html-table-row
28 column-num: 1
29 - key:
30 name: "control"
31 value:
32 get-data:
33 query:
34 type: html-table-row
35 column-num: 2
36 - key:
37 name: "risk-rating"
38 value:
39 get-data:
40 query:
41 type: html-table-row
42 column-num: 3
43 output-data:
44 post-processor:
45 remove-header-row:
46 output-data:
47 type: dict
Now you can see the recursive nature of the get-data:
/ map-data:
/ output-data:
configurations, and as we recurse down in the configuration the data queried and mapped at a higher configuration level is passed to the nested configurations.
It’s worth repeating that the configuration get-data:
with type: "html-table"
is designed to pass 1 row at a time to the subsequent map-data:
configuration, with the output-data:
(by default has type: list
) gathering the output of map-data:
to a yaml
list under the threats
key.
Let’s explain the new lines, line-by-line:
14 map-data:
15 - key:
16 name: "data-flow"
|
Is going to map some content (defined in |
17 value:
18 get-data:
19 query:
20 type: html-table-row
|
Is configuration defining the type of data we are processing. We use |
21 column-num: 0
|
Is configuration saying get the contents of column 0 in this table row. |
The same logic applies to the configuration of each of the 4 list items defined under map-data:
. It’s also important to note that threatware specifically does not require you to use the name of the column to get content - this was a design choice so that column names could be changed in documents without breaking the document content conversion (but also note that the name of the header above the table is used in the scheme document, so changing that would break document conversion).
Lastly there was:
43 output-data:
44 post-processor:
45 remove-header-row:
|
This configuration strips the header row of the table from the output, as that is not content we want to capture. Note, this isn’t done by default as HTML tables don’t always include header rows. |
At this stage we have a scheme file sufficiently defined to convert a Google Doc containing a single table into a yaml
file containing the contents of the table. However, we don’t have enough information to validate the contents of the table are correct. To do that we need to add metadata to the data that we want validate. To do this in the scheme file format we add a tags:
configuration.
Let’s imagine we want to validate the data-flow
and risk-rating
data to make sure it has the correct format. Also, we want the entry for control
not to be mandatory (mandatory is the default).
We’ll also name our keys, which isn’t relevant in this example, but is essential if performing validation between tables. The scheme format decouples the key:
/ name:
configuration from the name used to reference content for validation, as this lets us define validation configuration in a language agnostic way (thus supporting internationalization).
1---
2scheme:
3 version: 1.0
4 document-storage: googledoc
5 map:
6 map-data:
7 - key:
8 name: threats
9 section: Threats Table
10 tags:
11 - threats-table-data
12 - value:
13 get-data:
14 - query:
15 type: "html-table"
16 xpath: "//html/body/h1[text()='Threats Table']/following::table[1]"
17 map-data:
18 - key:
19 name: "data-flow"
20 tags:
21 - data-flow
22 - validate-as-data-flow
23 value:
24 get-data:
25 query:
26 type: html-table-row
27 column-num: 0
28 - key:
29 name: "threat-description"
30 tags:
31 - threat-description
32 value:
33 get-data:
34 query:
35 type: html-table-row
36 column-num: 1
37 - key:
38 name: "control"
39 tags:
40 - control
41 - not-mandatory
42 value:
43 get-data:
44 query:
45 type: html-table-row
46 column-num: 2
47 - key:
48 name: "risk-rating"
49 tags:
50 - risk-rating
51 - validate-as-risk-rating
52 value:
53 get-data:
54 query:
55 type: html-table-row
56 column-num: 3
57 output-data:
58 post-processor:
59 remove-header-row:
60 output-data:
61 type: dict
If you look through the scheme format you can see the new tags:
configuration. Remember tags don’t cause anything to happen during conversion, they are just used by validation logic (so only relevant when the threatware validate
action is invoked).
Some tags to note:
7 - key:
8 name: threats
9 section: Threats Table
|
Is used to name the table location, and is used when a validation issue is discovered with an element of the table (so it is easier to find the error in the document). |
22 - validate-as-data-flow
51 - validate-as-risk-rating
|
These aren’t defined, but their definition could be added to validators_config.yaml, and would mean the validation logic, upon consuming the definition, would look for all key values in the output |
Note
In practice, writing the xpath
selectors to get content from HTML can be quite challenging, especially because the output HTML from Google Docs or Confluence contain a lot of additional elements and attributes. To make this easier, the scheme file format supports a pre-processor that is applied before the content gets passed to the root map:
configuration. That pre-processor can be used to strip formatting HTML content that otherwise makes extracting data in the HTML challenging.
See the example scheme files confluence-scheme-1.0.yaml and googledoc-scheme-1.0.yaml for appropriate pre-processing values for the corresponding document formats.
Configuring a new scheme file
After you create a new scheme file you need to make it available to use. To do that you can follow these steps:
Save you file in the
schemes/
directory where your configuration files are stored.Update
schemes/schemes.yaml
to include a new key that is the name of your scheme, with a value that is the file location (see the existing extries for examples).Now you can use your new scheme name where ever the scheme parameter is used.
Testing a new scheme file
To test the new scheme file created:
Use the CLI version of threatware, for speed of debugging
Enable
DEBUG
level logsUse the threatware convert action
Set the
convert
parameter-meta none
as this will avoid outputting tags, which makes the output easier to read. Obviously when you add tags to the scheme and want to check the tags are correctly being assigned, then you can remove the themeta
parameter (as tags are shown by default i.e.-meta tags
)
Reference
Query types
The get-data:
/ query:
supports the following type:
values:
Category |
Type |
Use Case |
Example |
Description |
---|---|---|---|---|
HTML |
|
Extract an HTML table from HTML |
get-data:
- query:
type: "html-table"
xpath: "//html/body/h1[text()='Details']/following::table[1]"
remove-header-row:
remove-rows-if-empty:
|
This returns the content of an HTML table as a list of rows. It will iterate over each row, passing it to the associated
|
HTML |
|
Get the value from a column given a row |
get-data:
query:
type: html-table-row
column-num: 1
|
This configuration expects to be passed a row from an |
HTML |
|
Get an HTML table as a series of columns |
get-data:
- query:
type: "html-table"
xpath: "//html/body/h1[text()='Details']/following::table[1]"
- query:
type: "html-table-transpose"
|
This is usually chained with an |
Text |
|
Returns individual data in a section of text |
get-data:
- query:
type: html-table-row
column-num: 2
- query:
type: text-split
split-by:
char-separators:
- ","
- "\n"
|
Given a value from a previous query as input, the
|
Text |
|
Returns lines that match |
get-data:
- query:
type: html-table-row
column-num: 2
- query:
type: text-match
line-regex: "(?i)confidentiality|integrity|availability"
|
Given a value from a previous query as input, the |
Text |
|
Returns a replacement value on match |
get-data:
- query:
type: html-table-row
column-num: 2
- query:
type: text-replace
match: "SSL"
replacement: "TLS"
|
Given a value from a previous query as input, the |
Value |
|
Returns a capture group from a regex |
get-data:
- query:
type: value-extract
regex: "^(\\w+):(.*)"
group: 1
|
Given a value from a previous query as input, the |
Value |
|
Returns lines that match |
get-data:
- query:
type: html-table-row
column-num: 2
- query:
type: value-match
line-regex: "(?i)confidentiality|integrity|availability"
|
Given a value from a previous query as input, the |
Value |
|
Returns a replacement value on match |
get-data:
- query:
type: html-table-row
column-num: 2
- query:
type: value-replace
match: "SSL"
replacement: "TLS"
|
Given a value from a previous query as input, the |
Value |
|
Returns a URL decoded version of the value |
get-data:
- query:
type: html-table-row
column-num: 2
- query:
type: value-urldecode
|
Given a value from a previous query as input, the |
Post processing options
The output-data:
/ post-processor:
supports the following values:
Post Processor |
Use Case |
Example |
Description |
---|---|---|---|
|
Removes empty rows from a table |
output-data:
post-processor:
remove-empty-rows:
ignore-keys:
- category
|
Given the output of |
|
Removes the 1st row from a table |
output-data:
post-processor:
remove-header-row:
|
Given the output of |
|
Allows a row to inherit the value of the row above in a given column |
output-data:
post-processor:
inherit-row-above-if-empty:
key: category
|
Given the output of |
|
Returns a replacement value on match |
output-data:
post-processor:
value-replace:
- match: "C"
replacement: "Confidentiality"
- match: "I"
replacement: "Integrity"
- match: "A"
replacement: "Availability"
|
Given some input (a string), looks for any of the list of |