This is a public log for my deduplication experiments.

December 20, 2022

INFO:__main__:load_dataset                    : 3414.68 seconds
INFO:__main__:minhash                         : 22966.13 seconds
INFO:__main__:clustering                      : 7676.72 seconds
INFO:__main__:filtering                       : 1118.62 seconds
INFO:__main__:save                            : 3105.66 seconds
INFO:__main__:Data Number (before)            : 40113161
INFO:__main__:Data Number (after)             : 21108567 (52.62%)
INFO:__main__:Duplicate Number                : 19004594 (47.38%)
INFO:__main__:Total Time                      : 38281.88 seconds
INFO:__main__:Deduplicated Dataset            : results/output/deduplicated
INFO:__main__:πŸ€— Happy Deduplicating πŸ€—

December 19, 2022

INFO:__main__:load_dataset                    : 22.89 seconds 
INFO:__main__:minhash                         : 3132.38 seconds 
INFO:__main__:clustering                      : 2448.83 seconds
INFO:__main__:filtering                       : 1916.08 seconds
INFO:__main__:save                            : 1508.60 seconds
INFO:__main__:Data Number (before)            : 15148604
INFO:__main__:Data Number (after)             : 13032937 (86.03%)
INFO:__main__:Duplicate Number                : 2115667 (13.97%)
INFO:__main__:Total Time                      : 9028.86 seconds
INFO:__main__:Deduplicated Dataset            : results/output/deduplicated
INFO:__main__:load_dataset                    : 2.33 seconds
INFO:__main__:minhash                         : 5.70 seconds
INFO:__main__:clustering                      : 2848.69 seconds
INFO:__main__:filtering                       : 364.72 seconds
INFO:__main__:save                            : 1553.03 seconds
INFO:__main__:Data Number (before)            : 15148604
INFO:__main__:Data Number (after)             : 13032937 (86.03%)
INFO:__main__:Duplicate Number                : 2115667 (13.97%)
INFO:__main__:Total Time                      : 4774.54 seconds
INFO:__main__:Deduplicated Dataset            : results/output/deduplicated
INFO:__main__:πŸ€— Happy Deduplicating

December 9, 2022

[12/09/22 20:24:44] INFO     load_dataset                    : 28.08 seconds                                                                                        minhash_deduplication_alt.py:167
                    INFO     minhash                         : 3689.60 seconds                                                                                      minhash_deduplication_alt.py:167
                    INFO     clustering                      : 6322.42 seconds                                                                                      minhash_deduplication_alt.py:167
                    INFO     filtering                       : 2235.21 seconds                                                                                      minhash_deduplication_alt.py:167
                    INFO     save                            : 1478.33 seconds                                                                                      minhash_deduplication_alt.py:167
                    INFO     Data Number (before)            : 15148604                                                                                             minhash_deduplication_alt.py:168
                    INFO     Data Number (after)             : 13032937 (86.03%)                                                                                    minhash_deduplication_alt.py:169
                    INFO     Duplicate Number                : 2115667 (13.97%)                                                                                     minhash_deduplication_alt.py:170
                    INFO     Total Time                      : 13753.72 seconds                                                                                     minhash_deduplication_alt.py:171
                    INFO     Deduplicated Dataset            : results/output/deduplicated                                                                          minhash_deduplication_alt.py:172
                    INFO     πŸ€— Happy Deduplicating πŸ€—                                                                                                              minhash_deduplication_alt.py:173

Some lessons:

  1. clustering and indexing become a bigger bottleneck for large datasets, for faster runtime (at the cost of losing intermediate results and caching), it is maybe better to run the original script. How can I improve this?
  2. Iterating a Dataset object creates a speed bottleneck in both indexing and clustering part. What is the best way to speed it up?
  3. There is a cost to be aggressive, and that is time! What is a better to way to remove those duplicates without building a graph? πŸ€”

Results are stored in gs://chenghao-data/{python,java,javascript} just in case.

Java

December 2, 2022

python minhash_deduplication_alt.py --dataset bigcode/the-stack --data-dir data/java --revision v1.1.a1 --cache-dir cache2 --ngram-size 5 --threshold 0.65 --min-token-length 0 --fast

Going to do this instead

python minhash_deduplication_alt.py --dataset bigcode/the-stack-dedup-pjj --data-dir data/java --revision v1.1.a1 --cache-dir cache2 --ngram-size 5 --threshold 0.7 --min-token-length 10 --fast

Results

[12/03/22 13:37:40] INFO     Load Dataset                    : 77.18 seconds                                                                                       minhash_deduplication_alt.py:756
                    INFO     Embed                           : 5052.87 seconds                                                                                     minhash_deduplication_alt.py:756
                    INFO     Create Index                    : 16253.12 seconds                                                                                    minhash_deduplication_alt.py:756
                    INFO     Save Index                      : 0.00 seconds                                                                                        minhash_deduplication_alt.py:756
                    INFO     Freeze Memory                   : 0.00 seconds                                                                                        minhash_deduplication_alt.py:756
                    INFO     Query                           : 1321.61 seconds                                                                                     minhash_deduplication_alt.py:756
                    INFO     Save Neighbors                  : 0.00 seconds                                                                                        minhash_deduplication_alt.py:756
                    INFO     Unfreeze Memory                 : 0.00 seconds                                                                                        minhash_deduplication_alt.py:756
                    INFO     Clustering                      : 10825.30 seconds                                                                                    minhash_deduplication_alt.py:756
                    INFO     Total Processing Time           : 34919.87 seconds                                                                                    minhash_deduplication_alt.py:756
                    INFO     Deduplicate                     : 605.83 seconds                                                                                      minhash_deduplication_alt.py:756
                    INFO     Save Deduplicated               : 2356.10 seconds                                                                                     minhash_deduplication_alt.py:756
                    INFO     Language                        : java                                                                                                minhash_deduplication_alt.py:758
                    INFO     Data Number (before filtering)  : 25124914                                                                                            minhash_deduplication_alt.py:759
                    INFO     Data Number (after filtering)   : 24972491                                                                                            minhash_deduplication_alt.py:760
                    INFO     Duplicate Number                : 4822205 (19.31%)                                                                                    minhash_deduplication_alt.py:761
                    INFO     Total Reduction                 : 4974628 (19.80%)                                                                                    minhash_deduplication_alt.py:762
                    INFO     Total Time                      : 37881.83 seconds                                                                                    minhash_deduplication_alt.py:765
                    INFO     ****************************************************************                                                                      minhash_deduplication_alt.py:766
                    INFO     Output Base                     : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content                              minhash_deduplication_alt.py:767
                    INFO     Concatenated Dataset            : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/concat                       minhash_deduplication_alt.py:768
                    INFO     Indexed Dataset                 : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/indexed                      minhash_deduplication_alt.py:769
                    INFO     Index                           : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/index.pkl                    minhash_deduplication_alt.py:770
                    INFO     Neighbor Dataset                : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/neighbors                    minhash_deduplication_alt.py:771
                    INFO     Duplicate IDs                   : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/duplicate_ids.json           minhash_deduplication_alt.py:772
                    INFO     Unique Paths                    : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/unique_paths.json            minhash_deduplication_alt.py:773
                    INFO     Graph                           : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/graph.networkit              minhash_deduplication_alt.py:774
                    INFO     Community                       : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/community.partition          minhash_deduplication_alt.py:775
                    INFO     Deduplicated Dataset            : results/bigcode/the-stack-dedup-pjj/default/70/data/java/train/content/deduplicated                 minhash_deduplication_alt.py:776
                    INFO     πŸ€— Happy Deduplicating πŸ€—                                                                                                             minhash_deduplication_alt.py:777

I am having issues with Java data with this command. The number of (duplicate, neighbors) is 27 million. In one subset (500k), there are 900 million edges, which means we are looking at 72 billion edges (80 subsets) in the whole dataset. That’s way too large to be handled on a single machine.

Javascript

python minhash_deduplication_alt.py --dataset bigcode/the-stack-dedup-pjj --data-dir data/javascript --revision v1.1.a1 --cache-dir cache2 --ngram-size 5 --threshold 0.7 --min-token-length 10 --fast

Results

[12/04/22 01:12:09] INFO     Load Dataset                    : 779.31 seconds                                                                                                                                                                                          minhash_deduplication_alt.py:756                    INFO     Embed                           : 11697.71 seconds                                                                                                                                                                                        minhash_deduplication_alt.py:756
                    INFO     Create Index                    : 16848.07 seconds                                                                                                                                                                                        minhash_deduplication_alt.py:756                    INFO     Save Index                      : 0.00 seconds                                                                                                                                                                                            minhash_deduplication_alt.py:756
                    INFO     Freeze Memory                   : 0.00 seconds                                                                                                                                                                                            minhash_deduplication_alt.py:756                    INFO     Query                           : 1099.03 seconds                                                                                                                                                                                         minhash_deduplication_alt.py:756
                    INFO     Save Neighbors                  : 0.00 seconds                                                                                                                                                                                            minhash_deduplication_alt.py:756                    INFO     Unfreeze Memory                 : 0.00 seconds                                                                                                                                                                                            minhash_deduplication_alt.py:756
                    INFO     Clustering                      : 3331.91 seconds                                                                                                                                                                                         minhash_deduplication_alt.py:756                    INFO     Total Processing Time           : 34497.02 seconds                                                                                                                                                                                        minhash_deduplication_alt.py:756
                    INFO     Deduplicate                     : 488.48 seconds                                                                                                                                                                                          minhash_deduplication_alt.py:756
                    INFO     Save Deduplicated               : 5661.81 seconds                                                                                                                                                                                         minhash_deduplication_alt.py:756
                    INFO     Language                        : javascript                                                                                                                                                                                              minhash_deduplication_alt.py:758
                    INFO     Data Number (before filtering)  : 25429179                                                                                                                                                                                                minhash_deduplication_alt.py:759
                    INFO     Data Number (after filtering)   : 24477438                                                                                                                                                                                                minhash_deduplication_alt.py:760
                    INFO     Duplicate Number                : 4183219 (17.09%)                                                                                                                                                                                        minhash_deduplication_alt.py:761
                    INFO     Total Reduction                 : 5134960 (20.19%)                                                                                                                                                                                        minhash_deduplication_alt.py:762
                    INFO     Total Time                      : 40647.41 seconds                                                                                                                                                                                        minhash_deduplication_alt.py:765
                    INFO     ****************************************************************                                                                                                                                                                          minhash_deduplication_alt.py:766
                    INFO     Output Base                     : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content                                                                                                                            minhash_deduplication_alt.py:767
                    INFO     Concatenated Dataset            : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/concat                                                                                                                     minhash_deduplication_alt.py:768
                    INFO     Indexed Dataset                 : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/indexed                                                                                                                    minhash_deduplication_alt.py:769
                    INFO     Index                           : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/index.pkl                                                                                                                  minhash_deduplication_alt.py:770
                    INFO     Neighbor Dataset                : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/neighbors                                                                                                                  minhash_deduplication_alt.py:771
                    INFO     Duplicate IDs                   : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/duplicate_ids.json                                                                                                         minhash_deduplication_alt.py:772
                    INFO     Unique Paths                    : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/unique_paths.json                                                                                                          minhash_deduplication_alt.py:773
                    INFO     Graph                           : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/graph.networkit                                                                                                            minhash_deduplication_alt.py:774
                    INFO     Community                       : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/community.partition                                                                                                        minhash_deduplication_alt.py:775
                    INFO     Deduplicated Dataset            : results/bigcode/the-stack-dedup-pjj/default/70/data/javascript/train/content/deduplicated                                                                                                               minhash_deduplication_alt.py:776
                    INFO     πŸ€— Happy Deduplicating πŸ€—