Table of Contents

Learning Tabulify - Step 10 - Data Diff, a step by step

About

Data Diff is the cornerstone of every development because it validates the processing of data.

Tabulify ships with a Data diff operation that allows you to compare:

The diff operation (comparison) is implemented with the data diff command.

Data to compare

For the sake of simplicity, we will compare two csv data resources but you may compare any content data resources such as a SQL table or SQL query

last_name,first_name,birth_date,wikipedia_page
Schwarzenegger,Arnold,1947-07-30,https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman,Don,1935-12-25,https://en.wikipedia.org/wiki/Don_Norman
Harbison Carnagey,Dale,1888-11-24,https://en.wikipedia.org/wiki/Dale_Carnegie
Kahneman,Daniel,1934-03-05,https://en.wikipedia.org/wiki/Daniel_Kahneman
Pittman,Bob,1953-12-28,https://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)


Last Name,First Name,Birth Date,Wikipedia Page
Schwarzenegger,Arnold,1947-07-30,https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman,Don,1935-12-23,https://en.wikipedia.org/wiki/Don_Norman
Kahneman,Daniel,1934-03-05,https://en.wikipedia.org/wiki/Daniel_Kahneman


The fact that a record is missing can be easily spotted but there is a more subtle differences.

Failure Behavior

By default, a diff that detects differences will fail (ie return a bad exit code).

The Summary report

All diff can be executed with the data diff command by giving as argument the source and target data resource.

tabul data diff \
  --no-fail \
  diff/characters_original.csv@howto diff/characters_new.csv@howto
Diff Summary Report
The data resources are not equals.
from                                 to                              equals   record_count   change_count
----------------------------------   -----------------------------   ------   ------------   ------------
diff/characters_original.csv@howto   diff/characters_new.csv@howto                       5              6

The default report is to show a high level report that lists the source and target resources and the result of the comparison:

There is a total of:

It doesn't seem quite right, let's take a look at it by asking a unified report to see the difference in details.

The Unified report

To get a diff at the record level, you set the report-type option to the unified value.

tabul data diff \
    --no-fail \
    --report-type unified \
    diff/characters_original.csv@howto diff/characters_new.csv@howto
Unified Diff Report between the resources (diff/characters_original.csv@howto) and (diff/characters_new.csv@howto).
 - �[31mred�[0m   for deleted
 - �[32mgreen�[0m for added
 - �[34mblue�[0m  for type change (example: number as varchar)
The data resources are not equals.
diff_status   last_name           first_name   birth_date   wikipedia_page
-----------   -----------------   ----------   ----------   --------------------------------------------------------------
=             Schwarzenegger      Arnold       1947-07-30   https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+             �[0m�[32mNorman              �[0m�[32mDon          �[0m�[32m1935-12-23   �[0m�[32mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[32m+             �[0m�[32mKahneman            �[0m�[32mDaniel       �[0m�[32m1934-03-05   �[0m�[32mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m-             �[0m�[31mNorman              �[0m�[31mDon          �[0m�[31m1935-12-25   �[0m�[31mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[31m-             �[0m�[31mHarbison Carnagey   �[0m�[31mDale         �[0m�[31m1888-11-24   �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
�[31m-             �[0m�[31mKahneman            �[0m�[31mDaniel       �[0m�[31m1934-03-05   �[0m�[31mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m-             �[0m�[31mPittman             �[0m�[31mBob          �[0m�[31m1953-12-28   �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m

This report will shows you how to get from:

It tells us that we need to:

from the original resource to obtain the new resource

It's a total of 6 changes as seen previously in the summary report.

But why does the diff did not compare the Norman record to each other ?

This is because their is no known identifier for the record, therefore the diff was executed by record hash and not by last name.

The next step will show you how to define the unique columns to drive a diff comparison.

The Unified Report with Driver Columns Definition

To improve our comparison, we will define the unique column of our data resource with the --driver-columns option.

tabul data diff \
   --no-fail \
   --report-type unified \
   --driver-columns "Last Name" \
   diff/characters_original.csv@howto diff/characters_new.csv@howto
Unified Diff Report between the resources (diff/characters_original.csv@howto) and (diff/characters_new.csv@howto).
 - �[31mred�[0m   for deleted
 - �[32mgreen�[0m for added
 - �[34mblue�[0m  for type change (example: number as varchar)
The data resources are not equals.
diff_status   last_name           first_name   birth_date   wikipedia_page
-----------   -----------------   ----------   ----------   --------------------------------------------------------------
=             Schwarzenegger      Arnold       1947-07-30   https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+1            �[0mNorman              Don          �[32m1935-12-23   �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m-1            �[0mNorman              Don          �[31m1935-12-25   �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m-             �[0m�[31mHarbison Carnagey   �[0m�[31mDale         �[0m�[31m1888-11-24   �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
=             Kahneman            Daniel       1934-03-05   https://en.wikipedia.org/wiki/Daniel_Kahneman
�[31m-             �[0m�[31mPittman             �[0m�[31mBob          �[0m�[31m1953-12-28   �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m

This time, the unified report tell us that we need to:

from the original resource to obtain the new resource

Conclusion

This page ends the learning guide of Tabulify at the command line with Tabul where we have learned to perform data operation one command at a time.