This is a public log for my deduplication experiments.
December 20, 2022
INFO:__main__:load_dataset : 3414.68 seconds
INFO:__main__:minhash : 22966.13 seconds
INFO:__main__:clustering : 7676.72 seconds
INFO:__main__:filtering : 1118.62 seconds
INFO:__main__:save : 3105.66 seconds
INFO:__main__:Data Number (before) : 40113161
INFO:__main__:Data Number (after) : 21108567 (52.62%)
INFO:__main__:Duplicate Number : 19004594 (47.38%)
INFO:__main__:Total Time : 38281.88 seconds
INFO:__main__:Deduplicated Dataset : results/output/deduplicated
INFO:__main__:π€ Happy Deduplicating π€
December 19, 2022
INFO:__main__:load_dataset : 22.89 seconds
INFO:__main__:minhash : 3132.38 seconds
INFO:__main__:clustering : 2448.83 seconds
INFO:__main__:filtering : 1916.08 seconds
INFO:__main__:save : 1508.60 seconds
INFO:__main__:Data Number (before) : 15148604
INFO:__main__:Data Number (after) : 13032937 (86.03%)
INFO:__main__:Duplicate Number : 2115667 (13.97%)
INFO:__main__:Total Time : 9028.86 seconds
INFO:__main__:Deduplicated Dataset : results/output/deduplicated
INFO:__main__:load_dataset : 2.33 seconds
INFO:__main__:minhash : 5.70 seconds
INFO:__main__:clustering : 2848.69 seconds
INFO:__main__:filtering : 364.72 seconds
INFO:__main__:save : 1553.03 seconds
INFO:__main__:Data Number (before) : 15148604
INFO:__main__:Data Number (after) : 13032937 (86.03%)
INFO:__main__:Duplicate Number : 2115667 (13.97%)
INFO:__main__:Total Time : 4774.54 seconds
INFO:__main__:Deduplicated Dataset : results/output/deduplicated
INFO:__main__:π€ Happy Deduplicating
December 9, 2022
[12/09/22 20:24:44] INFO load_dataset : 28.08 seconds minhash_deduplication_alt.py:167
INFO minhash : 3689.60 seconds minhash_deduplication_alt.py:167
INFO clustering : 6322.42 seconds minhash_deduplication_alt.py:167
INFO filtering : 2235.21 seconds minhash_deduplication_alt.py:167
INFO save : 1478.33 seconds minhash_deduplication_alt.py:167
INFO Data Number (before) : 15148604 minhash_deduplication_alt.py:168
INFO Data Number (after) : 13032937 (86.03%) minhash_deduplication_alt.py:169
INFO Duplicate Number : 2115667 (13.97%) minhash_deduplication_alt.py:170
INFO Total Time : 13753.72 seconds minhash_deduplication_alt.py:171
INFO Deduplicated Dataset : results/output/deduplicated minhash_deduplication_alt.py:172
INFO π€ Happy Deduplicating π€ minhash_deduplication_alt.py:173
Some lessons:
Dataset object creates a speed bottleneck in both indexing and clustering part. What is the best way to speed it up?Results are stored in gs://chenghao-data/{python,java,javascript} just in case.
December 2, 2022
python minhash_deduplication_alt.py --dataset bigcode/the-stack --data-dir data/java --revision v1.1.a1 --cache-dir cache2 --ngram-size 5 --threshold 0.65 --min-token-length 0 --fast
Going to do this instead
python minhash_deduplication_alt.py --dataset bigcode/the-stack-dedup-pjj --data-dir data/java --revision v1.1.a1 --cache-dir cache2 --ngram-size 5 --threshold 0.7 --min-token-length 10 --fast
Results
[12/03/22 13:37:40] INFO Load Dataset : 77.18 seconds minhash_deduplication_alt.py:756
INFO Embed : 5052.87 seconds minhash_deduplication_alt.py:756
INFO Create Index : 16253.12 seconds minhash_deduplication_alt.py:756
INFO Save Index : 0.00 seconds minhash_deduplication_alt.py:756
INFO Freeze Memory : 0.00 seconds minhash_deduplication_alt.py:756
INFO Query : 1321.61 seconds minhash_deduplication_alt.py:756
INFO Save Neighbors : 0.00 seconds minhash_deduplication_alt.py:756
INFO Unfreeze Memory : 0.00 seconds minhash_deduplication_alt.py:756
INFO Clustering : 10825.30 seconds minhash_deduplication_alt.py:756
INFO Total Processing Time : 34919.87 seconds minhash_deduplication_alt.py:756
INFO Deduplicate : 605.83 seconds minhash_deduplication_alt.py:756
INFO Save Deduplicated : 2356.10 seconds minhash_deduplication_alt.py:756
INFO Language : java minhash_deduplication_alt.py:758
INFO Data Number (before filtering) : 25124914 minhash_deduplication_alt.py:759
INFO Data Number (after filtering) : 24972491 minhash_deduplication_alt.py:760
INFO Duplicate Number : 4822205 (19.31%) minhash_deduplication_alt.py:761
INFO Total Reduction : 4974628 (19.80%) minhash_deduplication_alt.py:762
INFO Total Time : 37881.83 seconds minhash_deduplication_alt.py:765
INFO **************************************************************** minhash_deduplication_alt.py:766
INFO Output Base : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content minhash_deduplication_alt.py:767
INFO Concatenated Dataset : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/concat minhash_deduplication_alt.py:768
INFO Indexed Dataset : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/indexed minhash_deduplication_alt.py:769
INFO Index : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/index.pkl minhash_deduplication_alt.py:770
INFO Neighbor Dataset : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/neighbors minhash_deduplication_alt.py:771
INFO Duplicate IDs : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/duplicate_ids.json minhash_deduplication_alt.py:772
INFO Unique Paths : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/unique_paths.json minhash_deduplication_alt.py:773
INFO Graph : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/graph.networkit minhash_deduplication_alt.py:774
INFO Community : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/community.partition minhash_deduplication_alt.py:775
INFO Deduplicated Dataset : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/deduplicated minhash_deduplication_alt.py:776
INFO π€ Happy Deduplicating π€ minhash_deduplication_alt.py:777
I am having issues with Java data with this command. The number of (duplicate, neighbors) is 27 million. In one subset (500k), there are 900 million edges, which means we are looking at 72 billion edges (80 subsets) in the whole dataset. Thatβs way too large to be handled on a single machine.
python minhash_deduplication_alt.py --dataset bigcode/the-stack-dedup-pjj --data-dir data/javascript --revision v1.1.a1 --cache-dir cache2 --ngram-size 5 --threshold 0.7 --min-token-length 10 --fast
Results
[12/04/22 01:12:09] INFO Load Dataset : 779.31 seconds minhash_deduplication_alt.py:756 INFO Embed : 11697.71 seconds minhash_deduplication_alt.py:756
INFO Create Index : 16848.07 seconds minhash_deduplication_alt.py:756 INFO Save Index : 0.00 seconds minhash_deduplication_alt.py:756
INFO Freeze Memory : 0.00 seconds minhash_deduplication_alt.py:756 INFO Query : 1099.03 seconds minhash_deduplication_alt.py:756
INFO Save Neighbors : 0.00 seconds minhash_deduplication_alt.py:756 INFO Unfreeze Memory : 0.00 seconds minhash_deduplication_alt.py:756
INFO Clustering : 3331.91 seconds minhash_deduplication_alt.py:756 INFO Total Processing Time : 34497.02 seconds minhash_deduplication_alt.py:756
INFO Deduplicate : 488.48 seconds minhash_deduplication_alt.py:756
INFO Save Deduplicated : 5661.81 seconds minhash_deduplication_alt.py:756
INFO Language : javascript minhash_deduplication_alt.py:758
INFO Data Number (before filtering) : 25429179 minhash_deduplication_alt.py:759
INFO Data Number (after filtering) : 24477438 minhash_deduplication_alt.py:760
INFO Duplicate Number : 4183219 (17.09%) minhash_deduplication_alt.py:761
INFO Total Reduction : 5134960 (20.19%) minhash_deduplication_alt.py:762
INFO Total Time : 40647.41 seconds minhash_deduplication_alt.py:765
INFO **************************************************************** minhash_deduplication_alt.py:766
INFO Output Base : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content minhash_deduplication_alt.py:767
INFO Concatenated Dataset : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/concat minhash_deduplication_alt.py:768
INFO Indexed Dataset : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/indexed minhash_deduplication_alt.py:769
INFO Index : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/index.pkl minhash_deduplication_alt.py:770
INFO Neighbor Dataset : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/neighbors minhash_deduplication_alt.py:771
INFO Duplicate IDs : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/duplicate_ids.json minhash_deduplication_alt.py:772
INFO Unique Paths : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/unique_paths.json minhash_deduplication_alt.py:773
INFO Graph : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/graph.networkit minhash_deduplication_alt.py:774
INFO Community : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/community.partition minhash_deduplication_alt.py:775
INFO Deduplicated Dataset : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/deduplicated minhash_deduplication_alt.py:776
INFO π€ Happy Deduplicating π€