Results are out!!!

My GSoC project's main goals includes parallelizing the reading of structs in parquet files. This is very beneficial to the community as astronomical data involves data stored in structs and other nested structures.

This figure shows the performance increase/decrease the optimized parquet reader (which I made) has over the baseline main branch of apache/arrow when run on a standard free GitHub runner and the file in loaded into the RAM to remove the I/O overhead which is not relevant here.

For smaller files, the overhead of multi threading causes it to be slower, but for real life cases when the file sizes are large, the optimized reader now provides 25%-30%+ faster reading times.

When compared to flat parquet files, the nested struct now offer similar read speeds with multi threading enabled!!

Overally, the project is going at a great pace with verifiable results. I hope this integrates with the upstream apache arrow library soon so that this performance boost can help the people working with PyArrow on astronomy datasets.

The link to the PR - https://github.com/apache/arrow/pull/50158

Link to the results colab notebook - https://colab.research.google.com/drive/1TsxFkBSI_Iq0hfXEwNDs_D24acr3yxdC?usp=sharing

Orchestrating and benchmarking repo - https://github.com/OmBiradar/pyarrow-lincc-fw-openastronomy-gsoc26