Data Diff is the cornerstone of every development because it validates the processing of data.
Tabulify ships with a Data diff operation that allows you to compare:
The diff operation (comparison) is implemented with the data diff command.
For the sake of simplicity, we will compare two csv data resources but you may compare any content data resources such as a SQL table or SQL query
last_name,first_name,birth_date,wikipedia_page
Schwarzenegger,Arnold,1947-07-30,https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman,Don,1935-12-25,https://en.wikipedia.org/wiki/Don_Norman
Harbison Carnagey,Dale,1888-11-24,https://en.wikipedia.org/wiki/Dale_Carnegie
Kahneman,Daniel,1934-03-05,https://en.wikipedia.org/wiki/Daniel_Kahneman
Pittman,Bob,1953-12-28,https://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)
Last Name,First Name,Birth Date,Wikipedia Page
Schwarzenegger,Arnold,1947-07-30,https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman,Don,1935-12-23,https://en.wikipedia.org/wiki/Don_Norman
Kahneman,Daniel,1934-03-05,https://en.wikipedia.org/wiki/Daniel_Kahneman
The fact that a record is missing can be easily spotted but there is a more subtle differences.
By default, a diff that detects differences will fail (ie return a bad exit code).
All diff can be executed with the data diff command by giving as argument the source and target data resource.
tabul data diff \
--no-fail \
diff/characters_original.csv@howto diff/characters_new.csv@howto
Diff Summary Report
The data resources are not equals.
from to equals record_count change_count
---------------------------------- ----------------------------- ------ ------------ ------------
diff/characters_original.csv@howto diff/characters_new.csv@howto 5 6
The default report is to show a high level report that lists the source and target resources and the result of the comparison:
There is a total of:
It doesn't seem quite right, let's take a look at it by asking a unified report to see the difference in details.
To get a diff at the record level, you set the report-type option to the unified value.
tabul data diff \
--no-fail \
--report-type unified \
diff/characters_original.csv@howto diff/characters_new.csv@howto
Unified Diff Report between the resources (diff/characters_original.csv@howto) and (diff/characters_new.csv@howto).
- �[31mred�[0m for deleted
- �[32mgreen�[0m for added
- �[34mblue�[0m for type change (example: number as varchar)
The data resources are not equals.
diff_status last_name first_name birth_date wikipedia_page
----------- ----------------- ---------- ---------- --------------------------------------------------------------
= Schwarzenegger Arnold 1947-07-30 https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+ �[0m�[32mNorman �[0m�[32mDon �[0m�[32m1935-12-23 �[0m�[32mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[32m+ �[0m�[32mKahneman �[0m�[32mDaniel �[0m�[32m1934-03-05 �[0m�[32mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m- �[0m�[31mNorman �[0m�[31mDon �[0m�[31m1935-12-25 �[0m�[31mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[31m- �[0m�[31mHarbison Carnagey �[0m�[31mDale �[0m�[31m1888-11-24 �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
�[31m- �[0m�[31mKahneman �[0m�[31mDaniel �[0m�[31m1934-03-05 �[0m�[31mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m- �[0m�[31mPittman �[0m�[31mBob �[0m�[31m1953-12-28 �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m
This report will shows you how to get from:
It tells us that we need to:
from the original resource to obtain the new resource
It's a total of 6 changes as seen previously in the summary report.
But why does the diff did not compare the Norman record to each other ?
This is because their is no known identifier for the record, therefore the diff was executed by record hash and not by last name.
The next step will show you how to define the unique columns to drive a diff comparison.
To improve our comparison, we will define the unique column of our data resource with the --driver-columns option.
tabul data diff \
--no-fail \
--report-type unified \
--driver-columns "Last Name" \
diff/characters_original.csv@howto diff/characters_new.csv@howto
Unified Diff Report between the resources (diff/characters_original.csv@howto) and (diff/characters_new.csv@howto).
- �[31mred�[0m for deleted
- �[32mgreen�[0m for added
- �[34mblue�[0m for type change (example: number as varchar)
The data resources are not equals.
diff_status last_name first_name birth_date wikipedia_page
----------- ----------------- ---------- ---------- --------------------------------------------------------------
= Schwarzenegger Arnold 1947-07-30 https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+1 �[0mNorman Don �[32m1935-12-23 �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m-1 �[0mNorman Don �[31m1935-12-25 �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m- �[0m�[31mHarbison Carnagey �[0m�[31mDale �[0m�[31m1888-11-24 �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
= Kahneman Daniel 1934-03-05 https://en.wikipedia.org/wiki/Daniel_Kahneman
�[31m- �[0m�[31mPittman �[0m�[31mBob �[0m�[31m1953-12-28 �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m
This time, the unified report tell us that we need to:
from the original resource to obtain the new resource
This page ends the learning guide of Tabulify at the command line with Tabul where we have learned to perform data operation one command at a time.