Understanding Data Filtering

This section describes how the device implements in-depth identification of traffic content and blocks or alerts on traffic containing specified keywords.

Data Filtering

Data filtering falls into two types: file data filtering and application data filtering.

File data filtering filters the uploaded and downloaded files by keyword. You can specify the protocols for file transfer or the types of files to be filtered.

Application data filtering filters application content by keyword. For different applications, the device filters different contents.

**Table 1** Data filtering
Application		Filter By
protocol	HTTP	In the upload direction Content posted on microblogs Content posted on forums Search keyword Submitted information, such as registration information Name of the file to be uploaded In the download direction Content of web pages Name of the file to be downloaded using HTTP
	FTP	Name and content of the file to be uploaded or downloaded
	SMTP	Title, body, and attachment name of the sent mail
	POP3	Title, body, and attachment name of the received mail
	NFS	Uploaded and downloaded file
	SMB	Uploaded and downloaded file
	IMAP	Title, body, and attachment name of the received mail
	RTMPT	Name of the file transmitted using RTMPT
	FLASH	Name of the Flash file
File Sharing		Names of shared files

Keyword

Keyword refers to the content to be identified by the device in data filtering. The device performs the specified action for the files or applications containing the keyword. Generally, the keyword is confidential or illegal information.

The keyword includes predefined keywords and user-defined keywords.

Predefined keywords include bank card numbers, credit card numbers, social security numbers, ID card numbers, mobile phone numbers, and confidentiality (including confidential, secret, and top secret).
User-defined keywords can be texts or regular expressions.

The keywords that can be matched with texts or regular expressions contain a minimum of three bytes. Each ASCII character is one byte, and each Chinese character is two bytes.

For example, a keyword can match abc, but cannot match a, ab, or b.
- For a text keyword, you only need to enter the exact keywords to be filtered. Text keywords are easy to configure and are used for an exact match.
- Regular expression keywords provide fuzzy matching capability. For example, "." in "abc.de" can represent any single character. Therefore, "abc.de" can match "abcxde", "abcyde", or "abc8de".
  
  Keywords in a regular expression can be flexibly and efficiently matched, but the configuration must observe the rules of regular expressions. Table 2 lists the rules of regular expressions.

**Table 2** Rules of regular expressions
Character	Description
\	Add the escape character \ before the special characters to literally match them, such as, \., \(, and \).
.	Matches any single ASCII character or Chinese character. For example, abc.de can match abcade, abcyde, and abc8de. Logically, an regular expression cannot start or end with a .. For example, .abc\|def, abc.\|def, abc\|.def, abc\|def., and abc\|def.\|ghi are invalid inputs.
( )	Indicates the beginning or end of a subexpression. For example, (abc)+ can match abc and abcabc.
?	Matches the previous character or expression zero or one time. For example, abcd? can match abc and abcd. Note that the regular expression cannot be set to abc?. For example, if the match count is 0, the keyword must be ab, but the keyword that a regular expression can match must contain a least three bytes. Therefore, there must be at least four characters in front of ?.
*	Repeats the previous character or expression zero or more times. For example, abcd* can match abc, abcd, and abcddd. Note that the regular expression cannot be set to abc. For example, if the match count is 0, the keyword must be ab, but the keyword that a regular expression can match must contain a least three bytes. Therefore, there must be at least four characters in front of .
+	Repeats the previous character or expression one or more times. For example, abc+ can match abc and abcc, but not ab.
\|	Matches the expression either before or after the operator. For example. abc\|defg can match abc or defg. (a\|b)cde can match acde or bcde.
-	Creates an expression range. For example, [a-z] can match any single character from a to z, including a and z.
[ ]	Matches any single character that is contained within the brackets. For example, abc[def] can match abcd, abce, or abcf. [] must enclose at least one character. [] cannot enclose ASCII characters and Chinese characters at the same time. [] can enclose the escape character (\). [] can enclose a hyphen (-), but the characters must be from A to Z, a to z, or 0 to 9. For example, [b-d], [A-Q], and [2-9] are valid inputs, but [b-A], [k-a], and [k-] are invalid inputs.
{n}	Matches the previous character n times. n is a non-negative integer and less than 10. For example, abc{2} cannot match abc in oabco, but can match the abccs in oabcco.
{n,m}	Matches the previous character larger than or equal to n times but smaller than or equal to m times. Both n and m are non-negative integers smaller than or equal to 10, and n is smaller than m. For example, abcd{0,3} can match abc, abcd{1,3} can match abcdd, and (abc){1,5} can match abcabcabc.
\d	Matches a digit character. It equals to [0-9]. For example, abc\d can match abc0 and abc9.
\w	Matches any digit, letter, and underscore. For example, abc\w can match abc2, abcd, abcA, and abc_.

Action

If a keyword is identified in data filtering, the device performs the action listed in Table 3.

**Table 3** Actions performed by the device
Action	Description
Alert	The device generates logs but does not block the content.
Block	The device blocks the content and generates logs. For users, the web pages cannot be displayed, uploading or downloading files fails, and sending or receiving mails fails.
By Weight	Each keyword has a weight. The device adds the weights of identified keywords by matching count. If the sum of weights is less than the block threshold and greater than or equal to the alert threshold, the device generates an alarm. If the sum of weights is greater than or equal to the block threshold, the device blocks the traffic. For example, two keywords are defined on the device. The weight of keyword a is 1, and that of keyword b is 2. The alert threshold for data filtering is 1, and the block threshold is 5. Assume that keyword a appears once on the web page browsed by a user, the sum of weights is 1, which is equal to the alert threshold. The device generates a log, but the user can continue browsing the web page. If keyword a appears three times and keyword b appears twice on the web page browsed by a user, the sum of weights is 7 (3 x 1 + 2 x 2 = 7), which is greater than block threshold 5. The device blocks the web page and generates a log, and the web page cannot be displayed for the user.

Data Filtering Process

If traffic passing through the device matches a security rule, the action is permit, and the data filtering profile is referenced in the security policy, data filtering must be implemented on the traffic.

The data filtering process is as follows:

The device detects and identifies the traffic content.

For an application, the identified content includes the application type and transmission direction. For a file, the identified content includes the protocol used for transmitting the file, the file type, and transmission direction.
The device compares the traffic features with the conditions in the data filtering rule. If all conditions are matched, the traffic matches the data filtering rule. Otherwise, the next rule is compared. If no data filtering rule is matched, the device permits the traffic.
If the traffic matches a data filtering rule, the device checks whether any keyword defined in the data filtering rule exists in the traffic content. If a keyword is identified, the device performs the specified action. If no keyword is identified, the device permits the traffic.