[Defense] Efficient and Accurate Machine Learning Model Computation in Data Science Languages with Data Summarization via a Gram Matrix
Tuesday, April 12, 2022
11:45 am - 1:45 pm
In
Partial
Fulfillment
of
the
Requirements
for
the
Degree
of
Doctor
of
Philosophy
Sikder
Tahsin
Al
Amin
will
defend
his
dissertation
Efficient
and
Accurate
Machine
Learning
Model
Computation
in
Data
Science
Languages
with
Data
Summarization
via
a
Gram
Matrix
Abstract
Nowadays, data science analysts prefer 鈥渆asy鈥 high-level languages for machine learning computation like R and Python, but they present memory and speed limitations. Also, scalability is another issue when the data set size grows. Data summarization has been a fundamental technique in data mining that has promise with more demanding data science applications. With these motivations in mind, an efficient way to compute the statistical and machine learning models with data summarization is presented that can work both in a sequential and parallel manner and can be easily integrated with popular data science languages. The summarization produces one or multiple summaries, accelerates a broader class of statistical and machine learning models, and requires a small amount of RAM. The solution can also compute the models in an incremental manner where the algorithms interleave model computation periodically, as the data set is being summarized. Experimental evaluations prove that the solution can work on both data subsets and full data set without any performance penalty. Also, the performance of the solution is compared for a single machine and in parallel. For a single machine, it has an edge over R and is competitive with Python. And for parallel, it is faster than other parallel big data systems, Spark (Spark-MLlib library), and a parallel DBMS (similar approach implemented with UDFs and SQL queries).
Tuesday,
April
12,
2022
11:45AM
-
1:45PM
CT
Hybrid:
PGH
392
and
virtual
via
Dr. Carlos Ordonez, dissertation advisor
Faculty, students and the general public are invited.
