Data Profiling & Quality

Last modified on


1. Introduction

Data profiling is the process of examining data from some system and producing metadata that describes the data. Data profiling can be viewed as a fact-finding step that informs (and refines) your data quality rules.

Data quality rules are properties of data that you expect to hold true, while data profiling results are the actual properties of the data that you observe.

2. Column Profiling

Column profiling involves the analysis of data within a single column.

2.1. Cardinality

Cardinality analyzes the number of distinct values that appear within a column. High cardinality is often associated with columns that contain unique identifiers, such as customer IDs or email addresses, as well as numerical values. Low cardinality is often associated with columns that contain categorical data, such as gender or status codes, where the number of possible values is limited.

Cardinality of one is a special case where all values in the column are the same. This is often seen in columns that contain a constant value, such as a status code that is always set to "active" or "inactive". Cardinality of zero (assuming NULLs are not counted) indicates that the column is empty, meaning there are no values present in the column.

There can be situations where a numerical column has a low cardinality, often due to the values being limited to a small set of integer values.

2.2. Frequencies

Frequencies give an overview of the distribution of records. It allows to view the most and least frequently occurring values and lengths.

Frequencies can be used to identify special values, such as hidden NULL equivalents (e.g. values "n/a" or special numeric values that indicate absence of information). Frequencies can also be used to identify outliers, which are values that occur infrequently compared to the rest of the data.

2.3. Sparseness and Missing information

Sparseness evaluates the percentage of the elements that are not populated.

Note that sometimes there are values that are conceptually missing but are not literally NULL (or None in Python). For example, a column that contains a status code may have a value of "n/a" or "unknown" to indicate that the status is not applicable or not known. Such values should be treated carefully.

2.4. Uniqueness

Uniqueness indicates if each of the values assigned to the attribute is unique. This often indicates that the attribute is a primary key or a unique identifier for the records in the table.

2.5. Numeric value statistics

Numeric value statistics provide insights into the distribution of numeric values within a column. This includes measures such as maximum and minimum values, average value, quartiles, and standard deviation.

These statistics usually can't be interpreted without understanding of the semantics of the data. For example, a maximum value of 1000 in a column that represents age is very different from a maximum value of 1000 in a column that represents income. Negative values in a column that represents age may indicate an error, while negative values in a column that represents temperature may be perfectly valid.

2.6. Value length statistics

Value length statistics provide insights into the length of the values within a column. Usually, we are most interested in the minimum, maximum, and maybe average or median length of the values.

Long values may indicate a text column that contains free-form text, while short values may indicate a categorical column with limited possible values.

2.7. Value Patterns

Value patterns involve analyzing the structure of the values in a column. This can include identifying patterns such as regular expressions, specific formats, or other structural characteristics.

For example, a column that contains phone numbers may have values that follow a specific pattern, such as "(123) 456-7890" or "123-456-7890".

2.8. Data Type Identification

Data type identification involves determining the data type of the values in a column. This can include identifying numeric, alphanumeric, date, time, and other data types.

This is actually a lot trickier than it may seem as the number of potentially valid data types is very large. For example, a column that contains dates may have values in different formats, such as "2023-10-01", "01/10/2023", or "October 1, 2023". While such values could be coarsely classified as "date", they may require different parsing strategies to be correctly interpreted.

Further, the data type of a column may not be consistent across all records. For example, a column that is expected to contain numeric values may have some records with non-numeric values, such as "n/a" or "unknown".

This requires the concept of a column type to be quite flexible.

Note that may want to draw a distinction between the semantic type of the column (e.g. "date", "email", "phone number"), the actual data type of the values in the column (e.g. "string", "integer", "float"), and the format of the values in the column (e.g. "YYYY-MM-DD", "MM/DD/YYYY", "DD Month YYYY"). The semantic type is often determined by the context in which the column is used, while the data type and format are determined by the actual values in the column.

2.9. Text Quality

2.9.1. Likely incomplete values

  • Text values that are literally blank or contain only whitespace characters.
  • Text values that are simply too short to be meaningful, such as a single character or a very short word.
  • Text that contains a single quote ' or double quote " without a matching closing quote.
  • Text that appears to be incomplete, such as a word that is cut off in the middle (e.g. "exampl" instead of "example").
  • Unbalanced delimeters (e.g., {}, [], «», quotes, back-ticks)
  • Text containing a pair of empty parentheses (), empty brackets [], or empty braces {}. The contents might have been stripped away.

2.9.2. Spaces

Spaces analysis involves looking at leading and trailing spaces, as well as the maximum number of spaces between words.

This is primarily a standardization / data quality issue. Unnecessary spaces can lead to parsing errors and lookup issues.

2.9.3. Casing and Character Sets

The analysis of casing and character sets involves looking at upper/lower casing and foreign, alphanumeric, non UTF-8 characters.

  • Uppercase/Lowercase distribution.
  • Unexpected special symbols (e.g. ASCII control characters or invalid unicode).
  • Unexpected unicode in pure ASCII text fields.
  • Zero-width unicode characters (ZWSP, ZWJ, ZWNJ, & hair-spaces) used for watermarking or steganography
  • Homoglyphs and easily confusable characters (раypal.com using Cyrillic “а”)

2.9.4. Numeric Formatting

Beyond just the scale of values, numeric formatting involves looking at the number of decimal places, the use of commas or other separators, and the overall structure of numeric values.

2.9.5. Statistically unlikely patterns

Depending on the semantics of the data, there may be patterns that are statistically unlikely to occur. For example, you would not expect a lot of uncommon punctuation characters in names or addresses.

  • Unusual characters in names and addresses: {}[]_%#@^<>~?!.
  • Unusual punctuation characters at the beginning or end of a value, such as @example.com or #12345.
  • Unusual character sequences, such as ???, ,,,, or ....
  • $ not followed by digits: May indicate invalid currency formatting. (Ex: Cost: $abc)
  • Uncommon Unicode characters, such as emojis or special symbols.
  • Excessive use of certain characters, such as multiple consecutive dashes or underscores.
  • Text with digits immediately followed by a letter.
  • Math and logic symbols (≈ ± ∑ ÷ ⇒) inside names or addresses.
  • Lowercase followed by uppercase, e.g. "dH"
  • Gibberish / hashed blobs. Can be detected via entropy.
  • Text that appears to be a URL or email address, but does not follow the expected format (e.g. example.com without a protocol or user@domain without a domain tld).
  • Rare token ratios (vowels : consonants, digit density)
  • Pronounceability / vowel-to-consonant ratio. Vowel-to-consonant ratio < 20 % in a “person name” column. Produces unpronounceable strings like “Brrzxt” that are almost always junk [Linguistics Stack Exchange](https://linguistics.stackexchange.com/questions/28908/list-of-vowel-and-consonant-sequences-across-languages?utm_source=chatgpt.com). len(re.findall('[aeiou]', x, re.I))/len(x)
  • Only stop-words or punctuation (“and”, “the”, “.”)
  • Text starting with punctuation (exclamation mark, comma, dash, period, backslash, forward slash, colon, semicolon, asterisk, etc).
  • Hanging comma, e.g. "Jon Doe,"
  • Quoted values (starting and ending with a quote). Most likely indicates some kind of conversion issue.

2.9.6. Security-oriented patterns

Security-oriented patterns involve looking for patterns that may indicate potential security issues, such as SQL injection attempts, cross-site scripting (XSS) attacks, or other malicious input.

Some common patterns to look for include:

  • SQL injection tokens, such as ' OR 1=1, UNION SELECT, or --.
  • Cross-site scripting (XSS) tokens, such as <script>, javascript:, or onerror=.
  • HTML tags, such as <img>, <a>, or <iframe>.
  • CRLF or LF injection tokens, such as %0d%0a or %0a.
  • Null-byte injection tokens, such as %00 or \0, as well as raw null unicode bytes.
  • Path traversal tokens, such as ../, ..\\, or ..%2f.
  • Repeating invisible watermark patterns, such as zero-width space characters (ZWSP) every 5 characters.
  • Server-side include directives, such as <!--#exec cmd="ls"-->.

2.9.7. Database-specific issues

Strings at the limits of allowed text lengths for columns may indicate truncation issues. For example, if a column is defined as VARCHAR(50), and a significant number of values are close to 50 characters long (or there is an anomalous spike at 50 characters), this may indicate that the data is being truncated when inserted into the database.

2.10. Date and Time Quality

Plausible era – birthdate must yield age 0–120 yrs [CONTENTdm](https://cdm20045.contentdm.oclc.org/customizations/collection/p20045coll26/pages/CensusQuality/Test-IBVAgepublic.html?utmsource=chatgpt.com)

3. Cross-column Profiling

Cross-column profiling analyzes dependencies among sets of data attributes within the same table. It involves correlation analysis, key analysis, dependency analysis, and the determination of unique and distinct counts. This type of profiling aids in the identification of duplicate records and helps in the determination of primary keys and functional dependencies.

3.1. Dependency Analysis

Functional dependency analysis determines if there are embedded relationships between columns in a table. A functional dependency between column X and column Y says that, given any two records, if the corresponding values of column X are the same, then the corresponding values of column Y will be the same. This implies that column Y could be expressed as a function of column X. This abstraction can be extended to sets of columns as well.

Functional dependency analysis can be used for identification of redundant data and to determine if a column is a candidate for a primary key (all columns except the primary key must be functionally dependent on the primary key).

Examples of functional dependencies:

  • If a person has a unique social security number, then the social security number can be used to determine the person's name, date of birth, and other personal information.
  • If a product has a unique SKU (Stock Keeping Unit), then the SKU can be used to determine the product's name, price, and other attributes.
  • Zip code can be used to determine the city and state.
  • Height and weight can be used to determine the body mass index (BMI).

3.1.1. Approximate Functional Dependencies

Approximate functional dependencies are a relaxed version of functional dependencies, that hold a significant portion (e.g. 95–99%) of the time, but not always.

3.2. Primary Key Analysis

Primary Key analysis identifies potential primary keys in a table. Primary keys are unique identifiers (in general, combinations of columns) for records in a table, and they usually must satisfy the following conditions:

  • Uniqueness: Each value in the primary key must be unique across all records.
  • Non-nullability: The primary key cannot contain NULL values.
  • Minimality: The primary key should be as small as possible while still uniquely identifying each record.

The issue is that often you have multiple columns that can be used as a primary key, e.g. a combination of first name, last name, and date of birth can uniquely identify a person, but so can a social security number or a driver's license number.

3.3. Correlation Analysis

Correlation analysis evaluates the degree of association between two attributes. This strictly speaking doesn't have to be a simple linear correlation, but can also include more complex relationships (e.g. Kendall's Tau, Mutual Information, etc.).

This is different from functional dependency analysis, as correlation does not imply a functional relationship. For example, two columns may be correlated but not functionally dependent on each other: height and weight of a person are correlated, but knowing the height does not allow you to determine the weight.

Strictly speaking, correlation analysis can be performed on any two columns, not just numeric ones. For example, there is likely a "correlation" between the first name and the year of birth of a person, as certain names are more common in certain generations.

3.4. Monotonicity / Sorting Analysis

Often a column is sorted in a specific order, such as ascending or descending. This technically is an example of an ordering relationship between the index of a row and the value of a column.

3.5. Conditional Relationships

Conditional relationships are relationships between columns that depend on the values of other columns.

3.5.1. Overloading

Overloading attempts to determine if an attribute (column) is being used for multiple purposes. This occurs when a single column contains values that represent different types of information, often depending on context or other columns. For example, a column might sometimes contain a phone number and other times an email address, or a status column might encode both state and reason in a single value.

Detecting overloading typically involves analyzing the patterns, formats, and distributions of values within a column to identify inconsistencies or multiple distinct value types. Overloaded columns can complicate data processing, validation, and analysis, and are generally considered a data quality issue.

4. Cross-table Profiling

Cross-table profiling evaluates relationships across tables. This involves foreign key analysis, identification of orphaned records, identification of redundant data, and determination of semantic and syntactic differences. It provides insights into the dependencies across tables, relationships, redundant storage, and opportunities for identification of data value sets that are mapped together.

Cross-table profiling iteratively reviews how the values within column sets in different tables potentially intersect and overlap. Reviewing the cardinality of the columns and the degree to which column data sets overlap across tables suggests dependencies across tables, relationships, and redundant storage, as well as opportunities for identification of data value sets that are mapped together. Some key capabilities for cross-table profiling include:

4.1. Inclusion dependencies

Inclusion dependencies are a type of relationship between two tables where the values in a column (or set of columns) in one table must exist in a column (or set of columns) in another table. This is often used to enforce referential integrity between tables, such as foreign key constraints.

For example, if a table of orders has a column for customer ID, and that customer ID must exist in a table of customers, then there is an inclusion dependency between the two tables.

Inclusion dependency is a practical way to identify foreign key relationships (though it is not a sufficient condition for a foreign key).

4.2. Domain-Overlap Analysis

Apart from analyzing simple inclusion, we may also look for columns that have a significant overlap in values, even if they are not strict subsets of each other.

4.3. Foreign Key Analysis

Foreign key analysis identifies relationships between tables. Foreign keys are columns (or sets of columns) in one table that reference the primary key of another table.

The conditions for a foreign key relationship are:

  • The foreign key column(s) in the child table must reference the primary key column(s) in the parent table.
  • The foreign key column(s) must not contain values that do not exist in the primary key column(s) of the parent table.
  • The foreign key column(s) may contain NULL values.

Sometimes, there are some orphaned records, which are child records that reference a parent record that does not exist. This can happen if the parent record was deleted or if the foreign key constraint was not enforced. In such cases, we would have to identify a FK relationship approximately as the inclusion dependency won't strictly hold.

4.4. Redundancy Analysis

Once the foreign key relationships are identified, we can perform redundancy analysis. Often due to denormalization, we have multiple tables that store the same data, or the same data is stored in different columns across tables. For example, a customer table may have a column for customer credit card number, and an orders table may also have a column for customer credit card number.