Data and Resources

We propose to develop or adapt a web-scraping tool that extracts data from Securities and Exchange Commission (SEC) filings. Publicly traded companies must annually submit 10-K filings (which detail financial performance) to the SEC. 10-K filings contain both forward looking statements and risk factor sections, which each consist of multiple paragraphs. These 10-K subsections may emphasize technical, market, or supply chain risk, among other factors. We plan to use unsupervised learning (e.g., cosine similarity) to cluster and compare companies based on their stated business model risks. As a starting point, a subset of companies from the iShares Nasdaq Biotechnology ETF (NASDAQ: IBB) will have Item1A (i.e., risk factor section) of their latest 10-K parsed. A combination of K-means clustering and cosine similarity will be applied in search of discrete groupings, or vulnerability classes. Once companies are grouped into risk categories, financial performance (e.g., net profit, changes in stock, etc.) will be compared pre- and post- COVID-19. Among biotechnology companies, we are interested in the relationship between 1) types of business model risk (as stated in SEC filings) and 2) financial growth during a pandemic. Our initial pipeline will compare only a handful of companies from IBB. However, we plan to generalize our pipeline to later handle the entire ETF, or other arbitrary sets of companies with SEC filings. Time permitting, we would also extend our cosine similarity efforts to social media data. For example, does the persona of a post/tweet among company (or executive) profiles pre-COVID hold any predictive power to the companys performance mid-COVID?

Additional Info

Source https://github.com/BU-Spark/summer2021internship/tree/master/Police%20Arrest%20Analysis
Last Updated June 12, 2023, 05:49 (UTC)
Created June 12, 2023, 05:44 (UTC)