«JOB AID Basic Data Mining and Analysis: Data Integrity, Description, and Anomaly Detection Data Integrity, Description, and Anomaly Detection Health ...»
JOB AID Basic Data Mining and Analysis:
Data Integrity, Description, and Anomaly Detection
Data Integrity, Description, and Anomaly Detection
Health care enterprises process large volumes of data yet may have problems transforming data into actionable
management information and business intelligence. Health care providers and other professionals also face a
breathtaking array of data mining and analysis methods, tools, products, and services. As a result, while understanding that data mining and analysis help establish better controls, operationalize the “sentinel” effect, and demonstrate commitment to “doing the right thing right,” it can be hard to know where to start the process.
Given program integrity’s general objective to identify “what is not right,” understanding what anomalies are and the basics of how to find them is a good place to start. In general, this ordered process should be a good first step.
1. Characterize the Question Before considering “what data do I need,” “where will I get it,” and “what will I do with it,” write down the question(s) you want to answer and then classify them by type. The classification scheme used by auditors, discussed below, can be quite helpful. It reveals how reviewers and auditors often structure their work and corresponds to the general flow of the data analysis process.
Audit questions fall into three general categories—descriptive, normative, or cause-and-effect: Descriptive: Provides descriptive information about specific conditions of a program or activity;
• Normative: Compares an observed outcome to what is expected; and • Cause-and-effect: Determines if observed conditions, events, or outcomes can be attributed to the operation • of the program or activity.
Descriptive analysis usually works with one variable at a time; that is, it is “univariate”;
• Normative analysis usually works with two variables—the “norm” and “what is”—and is, therefore, • usually “bivariate”; and Cause-and-effect analysis usually works with more than two variables and, as such, is “multivariate.” • And, as we shall see, by classifying question types and the analyses to answer them, these three categories also help one choose which tool(s) to use. These three categories also lay the foundation for the general process of data analysis: describe it, then norm it, and then try to predict it. This process moves from relative simplicity toward greater complexity as shown in Figure 1.
Figure 1. General Data Analysis Model Characterizing and documenting analytical questions help specify how and from where to collect the appropriate data, a topic not addressed in this document.
2. Control the Data Good data analysis starts with good data. Even the best analytical tools cannot cure invalid or unreliable data. Data
validity and reliability are bolstered by good information security (IS) including:
• Good physical access controls, such as controlled and task- and staff-specific access to particular • applications and administrative functions;
Policies and procedures on tailgating, unrecognized persons, suspicious activity, workstation information • control, and technology use while traveling and telecommuting;
Means to identify and address cyber crimes such as identity theft, credit card abuse, spam, malware, • hoaxes, cookies, Active X® applications, and phishing;
Appropriate control over personal electronics and software at the workplace; and • Firewalls, anti-virus and encryption software, and file back-up and retention. •
Health care data control includes protecting privacy using controls like:
Accessing, using, and sharing information only on a need-to-know basis;
• Storing, transferring, and transmitting personal identifying information (PII) only by encrypted means;
Using social security numbers (SSNs) and healthcare identification numbers only at the time needed and • never printing them in their entirety;
Shredding, rather than merely discarding, documents containing PII; and • Reporting and fully disposing of potential or actual privacy breaches and enacting appropriate • disciplinary measures. It can be helpful to discuss your business, data and information needs, and technology infrastructure with your medical practice management software vendor to identify material threats, vulnerabilities, and risks.
3. Know Your System The efficiency and effectiveness of data mining and analysis directly relate to the quality of the system used, electronic or otherwise. Finding, understanding, and controlling anomalies are the foundation of program integrity and require a measurement system that can: Detect and show small changes (resolution);
• Respond to change in equal, constant, or appropriate ways (linearity);
Specify differences between high-side (upscale) and low-side (down-scale) values (hysteresis);
• Be consistent with like systems (difference among like gauges or like configurations); and • Reveal variances in process, policy, practice, or people (difference among operations[tors]).
4. Know Your Limits Before “crunching” data, determine the capacity and capability of the software used for mining and analysis to avoid “hitting the wall” in the middle of an analysis project. On the data capacity side, find out about the:
Amount of computer memory required to run the software smoothly;
• Maximum number of available fields (columns) of data;
• Maximum number of available records (row) of data;
• Maximum number of characters available in a cell or mathematical formula; and •
Regarding capability, learn about:
Any special demands on, or requirements of, your operating system;
• What kinds of data (text and numbers) and file formats the software can handle;
• How easy or difficult it is to load, administer, change, and save files;
• Whether the tools needed are easy to get to without a lot of keystrokes or menu drill-downs;
• Whether basic arithmetic and graphs are easy to do;
• The types and numbers of formulas the software includes;
Getting product updates and assistance; and • The depth of detail and usefulness of the Help feature.
• Be familiar with the existing practice management technology, especially before investing in new hardware or software. Find out if special or one-off analyses or reports are better created by customizing current software or exporting data to, and using, another application.
5. Know Your Data
Before mining or analyzing data, be sure to:
Understand what each field (column) and record (row) contains and what they mean using a data dictionary, • vetting to the data source, or other suitable means;
Include a unique row counter, preferably in the far left-hand column;
• Save the original data set as a separate file and then work from a copy of that file;
• Formulate and document the question(s) you want to answer; and • Delete irrelevant fields or records to make the file more manageable.
6. Assess Data Quality What is “good” data? For auditors, data and the evidence they yield must be: Sufficient—Is there enough data to persuade a knowledgeable person that the analysis and its results are • reasonable?
Relevant—Does the data have a logical relationship with, and importance to, the issue being addressed?
• Valid—Does the data give a meaningful, reasonable basis for measuring what is being evaluated?
• Reliable—Will the data and related analysis provide consistent results when information is measured or • tested and are they verifiable or supported?
The quality of data mining and analysis can be enhanced by using multiple data analysis methods to help offset the weaknesses inherent in viewing or analyzing something in only one way. For example, interview evidence is more credible if supported by physical or documentary evidence.
Information from independent external sources is generally more reliable than from a single internal source.
7. Assess Data Integrity The basic data integrity procedures below apply to any data set. Remember that data errors can exist but not matter, though they might point to control issues beyond the given analytical context. The importance of each data integrity
issue hinges on several key questions:
Does the error relate to this analysis or this question?
Is this relationship material, that is, does it matter?
• Is this relationship significant? That is, does it matter enough to warrant seeking other data or changing or • abandoning the analysis or the question?
Does this type of error raise other important or future questions?
More specifically, inspect each column of data that matters to your analysis and look for:
Blanks—Some fields, like claim numbers or patient identification numbers, should not be blank.
• Zeroes—Some zeroes are appropriate, others are not, particularly if a zero is a proxy for a blank.
• Error Values—Entries like #N/A, #REF!, #NUM!, and #NULL! can be inappropriate or indicate that data • actually exists in the original source file but a calculation failed or data did not migrate.
Unprintable Characters—On-screen data sometimes contain apostrophes, dashes, carats, or other characters • that do not print but prevent data from functioning properly, especially when the data came from another (mainframe or legacy) computer application.
Unnecessary Spaces—Spaces appearing before the first or after the last character in a cell can be residues • of tabs, returns, or other commands that make even basic arithmetic impossible.
Numbers Formatted as Text—Quantities imported as text often fail to calculate properly.
• Duplicates—Depending on the data extract, some fields and data values should not duplicate, such as • claim numbers or dates.
Edits—An overabundance of edits, corrections, and adjustments can highlight control or education issues • or attempts to “cover one’s tracks.” Foreign Items—Data which are imported but were not actually requested or part of the data query are a • possible indication of misunderstanding or a broader software failure.
Unreasonable Values—This includes things like dates in the next century, ten-digit SSNs, two-digit Current • Procedural Terminology (CPT) codes, numbers in names, a million-dollar copayment, etc.
Sorting a column ascending and then descending (or vice versa) often reveals such data errors.
8. Know the Data Type Before choosing tools, set ground rules to help match data analysis method and purpose. Tool choice is driven by data type. Misinterpreting data type can result in inappropriate methods and create invalid, inaccurate, ineffective, or incorrect results.
Three general types of data exist: Nominal data, or “attributes,” use names, categories, or labels for qualitative values, such as gender, • ethnicity, job title, etc.
Interval variables, usually called just “variables,” are true numbers, like dollar amounts or age. Nominal • and ordinal data do not assert degree; for example, “one person is three times more male than another” or “person A said this training was five times more excellent than person B.” Interval variables have meaningful value differences and allow statements about extent or degree.
Ordinal data, “ranks,” are also categorical variables. The order of the categories has meaning, as in surveys • using an ordinal scale ranging from “poor” to “excellent.” Such categories are often converted to numbers (4, 3, 2, and 1) for further analysis.
Program integrity data analysis usually involves only attributes that answer questions about compliance and control (was it done right, “yes” or “no”), and variables, that answer questions about volume and value (how many claims were done right), as distinguished in Table 1.
Table 1. How Attributes and Variables Differ
Answer “yes/no” questions—“Are you male?” Answer numeric questions—“How many males?” Cannot assert degree—“I am twice as male.” Can assert degree—“I am twice as old.”
Can have only two values—“Yes” or “No.” Can have an infinite number of possible values.
9. Describe the Data There are two general types of descriptive statistics: Measures of central tendency—Where data tends to fall.
• Measures of spread—How spread out or concentrated the data are.
Three central tendency measures are common, and each works best with a given data type:
Mean (average value)—Best measure of a variable (quantity); for example, how much did the patient pay • for each visit last year on average?
Mode (most frequent value)—Best measure of an attribute (quality); for example, is the patient on • Medicaid?
Median (middle value)—Best measure of a rank; for example, does the patient rate services as excellent, • good, fair, or poor?
Common measures of data spread include:[8, 9] Range—Difference between largest and smallest values.
• Interquartile range—Difference between the 75th and 25th percentile.
• Standard deviation—Square root of squared average differences between each datum and the mean • —34 percent of the mean on the bell curve.
Skew—Measures whether data are symmetrical to the left and right of center—zero on the bell curve.
• Kurtosis—Measures whether the data are peaked or flat relative to a normal distribution—zero on the • bell curve.