Chapter 2 Survey of existing data sources The purpose of this chapter is to provide a structured map of useful sources of informa:on about an industry or occupa:on. This should provide a wide range of poten:al areas that could be explored. This chapter is focused on iden:fying possible data sources rather than actually gathering them. The inten:on is to analyse each poten:al data source in enough detail to be able to determine what kind of informa:on is available and how comprehensive the informa:on is. For example, if discussing the financial status of an organisa:on you would document how detailed the informa:on was concerning where income and expenditures were coming from and roughly what percentage of organisa:ons had this informa:on available. Actually obtaining the full data from a source may be prac:cally impossible given the :me within this module. The goal of this chapter is to provide an insight into the limits of what might be possible and to ensure that analysis focuses on what might lead to the most valuable insights rather than the easiest data to obtain. Most data sources will be par:al and will be biased in some way. For example, not all staff will be on a site like LinkedIn, not all of those who are on that site will record their job at a company. I.e. the data is par:al. It is also not a random sample as certain types of people will tend to be more likely to record this informa:on and the circumstances of the job and why it was leH may influence the likelihood of this data being recorded. In week 4 we will explore some op:ons to address these issues. At this stage the key thing is to find the best data sources you can find and not discard any data source out of hand because of these issues. The key to finding data sources is to do a lot of Google searches and to carefully navigate through websites. In addi:on there are a set of websites and government data sources that are always worth inves:ga:ng. In par:cular, a generally useful tool for geLng historical informa:on is the wayback machine (hNps://web.archive.org/) which records snapshots of websites that were publicly available at different points in :me. In this document any text within quotes “Like this” represents a google search term. Be very careful not to record personally iden:fiable informa:on of UK or EU ci:zens or those working in the UK or EU. Once you have recorded a piece of personally iden:fiable informa:on about an EU or UK ci:zen this data becomes subject to GDPR legisla:on which is very restric:ve and for most prac:cal purposes requires the permission of the individual to use. However, if you gather aggregate sta:s:cs using publicly available personal data manually i.e. count up the number of profiles with a certain property, then if you are careful you will not be storing and processing personally iden:fiable data. This is a link to a post discussing exis:ng research that makes use of LinkedIn, including some links to anonymised datasets that may be of use: hNps://blogs.lse.ac.uk/ impactofsocialsciences/2019/07/09/using-linkedin-for-social-research/. The main difficulty with gathering data in aggregate form is that you must gather all of the informa:on at once and you can’t merge and link datasets because you can’t record any iden:fying informa:on from the individuals you were surveying. Unfortunately, you can’t keep any iden:fiable informa:on including the use of a one way cryptographic hash. The sta:s:cs you gather are specific to the process you use and the period of :me you are gathering them however they can s:ll be highly informa:ve. Before doing a lot of LinkedIn reading you may want to change your visibility seLngs hNps:// www.linkedin.com/help/linkedin/answer/49410/browsing-profiles-in-private-and-semi-privatemode?lang=en In the examples below most data is focused on NI or UK sources. This is just an example, there is no need to focus your analysis in these areas and because of issues such as data privacy legisla:on there may be more value in exploring US or other regions. Different countries vary in the amount of publicly available informa:on and the number of organisa:ons in a given industry. The key informa:on to show for each sec:on is: • The informa:on that you are seeking to find for each sec:on and why it would be valuable • The process used to find the data e.g. google search terms, links from documents • The feasibility of obtaining the data, to what extent it can be linked with other informa:on • How comprehensive and/or biased the data obtained from the process might be It is easy to get overwhelmed when searching for informa:on as it isn’t structured in a way that provides perspec:ve on an industry. It is good to always remember your goal: Do work that makes it easier for others to iden:fy the causal factors that affect outcomes that decision makers within an industry value. Specifically to help them make good choices. I.The space of parts of work A.Organisa:ons Our goal with this sec:on is to iden:fy a list of organisa:ons, ideally with useful proper:es, that can be used as the basis for further inves:ga:on. Each organisa:on is poten:ally a useful star:ng point for further analysis. Next week you will be deeply inves:ga:ng one example of an organisa:on or project. In this chapter we are more focused on breadth, in par:cular acquiring a high level overview of an industry and how the organisa:ons vary within it. 1.SoHware Development We are seeking to iden:fy the companies that provide soHware development services. Poten:ally focused on a par:cular region such as NI or the UK. “soHware development companies northern ireland”, “soHware development companies UK” Companies house businesses are registered with an SIC code which you could use to find companies with a specific code by downloading their full database and processing the data records to iden:fy those with the SIC code you are looking for. It is important to note that there is no formal verifica:on of these codes so they may be in error (and in my experience frequently are). You can also find companies ac:vely working in an industry using job sites hNps://wecodeni.com/, hNps://www.nijobs.com/SoHware-Company-Jobs-in-Northern-Ireland. This will highlight larger organisa:ons and may be biased towards those that are growing or have a high turnover of staff. For larger organisa:ons in NI you can use employment data from the equali:es commission. You can then search for the names of the businesses online to iden:fy what business they are in (or it may be obvious from the company name). By using the numbers of employees, analysis can focus on larger or smaller organisa:ons or take a sample of different groups based on size. hNps://www.equalityni.org/Search-Results?q=data#gsc.tab=0&gsc.q=data&gsc.page=1 hNps://www.equalityni.org/ECNI/media/ECNI/Publica:ons/Delivering%20Equality/ FETO%20Monitoring%20Reports/No30/MonReport30-Private26-employees.pdf?ext=.pdf Using these approaches it is likely that we will be finding companies that adver:se their services broadly and may be focused on large, mul:-year business to business projects. There are poten:ally many more places where soHware development takes place, for example large companies tend to have internal soHware development teams and there are also independent web and mobile developers that take on contract work for individuals or small businesses. These may require other routes to iden:fy them, such as crowd working sites like Upwork.com or Fiverr.com. 2.Higher Educa:on Ideally we are looking to find a list of universi:es within a region or globally. There are many public lists of universi:es and sites that rank them with metadata. “universi:es in the uk” -> hNps://www.ukuni.net/universi:es “universi:es in the uk” -> hNps://en.wikipedia.org/wiki/List_of_universi:es_in_the_United_Kingdom These lists are rela:vely comprehensive. “organisa:ons that rank UK universi:es” (this provides a list including a number of different reviews sites) These sites also seem reasonably comprehensive, at least for UK universi:es. It would also be worth exploring the space of different university ranking systems and seeing if useful metadata for each university. It may also be valuable to gather data on the different departments within each university “ranking of university departments” -> hNps://www.:meshighereduca:on.com/world-universityrankings/by-subject “ranking of university departments” -> hNps://www.topuniversi:es.com/subject-rankings/2020 These rankings include metadata including staff/student ra:o and number of interna:onal students. Using staff numbers it may be possible to es:mate student numbers. If we could clarify whether the staff student ra:o included informa:on from postgraduate courses it may be possible to separate out es:mates of numbers of postgraduate students. The number of undergraduate and postgraduate students is a significant source of income for the department. B.Projects In this sec:on we are searching for examples of projects that organisa:ons have undertaken, ideally with metadata such as the budget, staffing, resources, management approach and :mescale of the projects. Even beNer if we can obtain more detailed task level data about the project. Projects can be hard to get informa:on on. I know from my interviews with experts that some government soHware projects are open source. At a higher level government contracts are supposed to be open (hNps://www.gov.uk/government/publica:ons/open-contrac:ng). However aHer a rela:vely quick search I couldn’t find any more detail that a single paragraph of descrip:on of a project, who had won the bid, it’s total cost and :mescale. The government website talked as if the en:re process was open. I emailed them to ask about how to find a detailed breakdown for soHware projects (tasks and milestones) as well as any informa:on on rival bids that weren’t accepted. I received no reply to my email. More may be possible with a freedom of informa:on request. 1.SoHware Development a)Government contract projects. “uk government tenders kainos” -> hNps://bidstats.uk/tenders/?q=kainos+soHware The bidstats website only contains the last year but it references the organisa:ons the informa:on has been taken from. If we could examine these sources then we might be able to get historical records that could be processed to get useful data. b)Men3ons in shareholder reports “kainos shareholder reports” -> hNps://www.kainos.com/investor-rela:ons/results-andpresenta:ons -> hNps://go.kainos.com/rs/272-PGO-379/images/202105FullYearPresenta:on.pdf (lists some customers) -> “kainos "irish life" “ -> links to linkedIn account of business analyst on project “kainos NHS Digital” (many different stories and case studies on different sites) -> “kainos NHS digital github” (links to a number of github repositories for Kainos staff which show contribu:ons to github projects) -> hNps://github.com/nhsconnect/integra:on-adaptor-gp2gp Including a freedom of informa:on request rela:ng to one of the projects: hNps:// www.whatdotheyknow.com/request/nhs_app_development_by_kainos These searches reveal some informa:on about the projects and produce some promising leads for gathering more informa:on, par:cularly the case studies and github repositories. The main concern here is that only the larger higher status companies might be men:oned in the shareholders report. Another useful source are company twiNer (or other social media) accounts which are oHen searching for any public men:on of the company. These may then contain men:ons of the projects that the company is working on: “kainos twiNer” -> hNps://twiNer.com/KainosSoHware? ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor The need for content makes it more likely makes it more likely that social media feeds will capture more of the media about the company (at least the posi:ve media). However, this may only be possible with larger organisa:ons that have staff with explicit responsibility for managing such an account. 2.Higher Educa:on What educa:onal programs are being run? “soHware development degree courses” -> hNps://www.postgraduatesearch.com/pgs/search? course=soHware-development hNps://www.whatuni.com/degree-courses/search?subject=soHware-development “QUB soHware conversion masters” -> hNps://www.qub.ac.uk/courses/postgraduate-taught/ soHware-development-msc/#course For each course the syllabus may be available enabling an assessment of whether certain topics were being covered. Using the wayback machine for many degree programs we may be able to iden:fy changes in the course content, entry requirements and fees over :me hNps://web.archive.org/web/*/hNps://www.qub.ac.uk/courses/postgraduate-taught/soHwaredevelopment-msc/ It is also valuable to explore rankings for individual academic courses “ranking of university courses” -> hNps://www.theguardian.com/educa:on/ng-interac:ve/2020/sep/ 05/the-best-uk-universi:es-2021-league-table This provides valuable sta:s:cs including the percentage of students geLng a career in the subject aHer gradua:on and the percentage that drop out of the course in the first year. These sta:s:cs oHen focus on undergraduate programs which may be less useful as the number of students on each program is likely to be capped, at least for local students part funded by government. C.Tasks Tasks are about analysing individual ‘Tickets’, the rela:vely atomic repeated ac:vi:es that are performed as part of projects. This is the highest detail level that is typically available for an industry and is oHen not gathered in a public way. However some governments, chari:es or otherwise par:cularly open and transparent organisa:ons can share how their work is performed. Task based datasets allow for a valida:on of the task/process modelling in the previous chapter and may enable us to create rela:vely precise aggregate sta:s:cs about projects. For example for a soHware project: the :me spent working on different parts of the project e.g. UI, database etc., the number of bugs fixed (and possibly created) by different team members etc. could be quan:fied and analysed. For individual roles there may be case studies available that could provide a general approach to understanding tasks and how they vary in an industry. However, case studies are oHen chosen because they were successful and may be performed in an unrealis:cally polished way. Another op:on is to search for projects that are run in an open and collabora:ve way as they may have all of their tasks documented publicly. 1.SoHware Development In soHware as most of the work is undertaken on computers and using soHware, detailed task level records are oHen gathered, for example by source control systems. For the requirements gathering and project management process: “public jira projects” -> hNps://www.reddit.com/r/atlassian/comments/lgcycy/public_jira_projects/ -> hNps://ecosystem.atlassian.net/jira/projects When soHware is used to track projects and the breakdown of the project into :ckets, these soHware systems may be made open for certain projects. If the soHware, e.g. Jira, includes apis for analysis it may be possible to analyse these projects to iden:fy how requirements are created and change over :me as well as when designs change aHer implementa:on to understand when the requirements process is working well or not. For the programming process: It may be possible to find government projects on github that correspond to government contracted projects so the company that delivered the project, the budget and :mescale can be iden:fied “uk government projects on github” -> hNps://www.api.gov.uk/#uk-government-apis Each of the apis in this list corresponds to a project which may be open sourced on github. Some of these may have been produced through contracts with soHware development companies by examining each in turn and cross referencing the project against records of government tenders we may be able to iden:fy budgets and :mescales. “kainos uk government github” -> hNps://github.com/companieshouse/payments.web.ch.gov.uk From our searches using a company name it may be possible to iden:fy projects that are being run by the company for a branch of UK government. We may then be able to iden:fy the public informa:on about the tender from the UK sources and iden:fy the budget and :mescale of the project. For UXDesign: “ux design case study examples” -> hNps://beszolios.medium.com/8-well-done-ux-case-studiesevery-designers-should-read-f15cfd8687e1 Unfortunately these examples are mostly showcases that may be heavily simplified/biased from the raw ac:vity and :me use informa:on we would really value. However they may show the key stages and tasks which are helpful for understanding the role. 2.Higher Educa:on Open courses with forums or communi:es that have been run over a period of :me. We can use the forum informa:on to see how students are facing the taught material of the course. In some cases we may be able to iden:fy the feedback, discussions and reasoning that have led to changes in the module due to public student comments. It will take some crea:vity to formalise these text based discussions into a compara:ve text based format. “free online course materials” -> hNps://www.openculture.com/freeonlinecourses Ideally we are looking for courses that have changed over :me and understand the interac:on between the creator of the course and the work being done. By examining how courses are taught differently at different ins:tu:ons we can examine how that material might impact other measurements such as student sa:sfac:on and employment sta:s:cs of different courses. Certain specialised online courses such as hNps://www.fast.ai/ are developed very publicly with an ac:ve community that includes the lecturer and where it may be possible to iden:fy the changes and what has mo:vated them and ideally the posi:ve and nega:ve consequences of those changes. D.Careers The goal of this sec:on is to find data about how careers progress over :me, ideally with any metadata about the nature of the roles and any factors that might have a significant causal impact on obtaining the role or why the person leH a previous role. For example, technical skills, salaries etc. 1.SoHware Development It is very valuable to have access to CVs of people in an industry to understand plausible career paths. The most popular site for this informa:on is LinkedIn. “linkedin kainos” -> hNps://www.linkedin.com/company/kainos/ Companies oHen have linkedin pages that will include links to people who have included the company in their CVs on the site. “kainos project” -> hNps://www.kainos.com/insights/blogs/path-project-management A previous search led to a link to a case study for a project management posi:on. 2.Higher Educa:on LinkedIn can also be used for academic educa:on. Universi:es also promote their staff on their websites. They oHen have structured staff biographies. Using the wayback machine it is also possible to iden:fy staff roles changing over :me. Informa:on that may not be as comprehensively recorded via LinkedIn. hNps://web.archive.org/web/*/hNps://www.qub.ac.uk/schools/eeecs/Connect/Staff/ E.Jobs The goal of this sec:on is to find data about the work carried out within a par:cular role. This is similar to the breakdown of tasks on a project but in this case focused on a par:cular role. From this informa:on we are seeking to iden:fy the varia:on in how a role is carried out. What tasks and skills are needed, how :me is spent, what responsibili:es the role has and what pay they have. Ideally we want to link these jobs to the organisa:ons that produce them and to the careers of the individuals pursuing those roles so we can see what might cause the roles to differ and the impact that the jobs have on the careers of the individuals who par:cipate in them. 1.SoHware Development Day in the life ar:cles. These provide another source for the tasks that different roles take part in and may include :ming informa:on that may clarify how much of a person’s day is spent engaged in different ac:vi:es. “a day in the life of a soHware developer” -> hNps://codeins:tute.net/blog/a-day-in-the-life-of-asoHware-developer/ O*net is a comprehensive formalisa:on of jobs produced by the US Government. It is based on a combina:on of the analysis of job pos:ngs and interviews with job holders. It includes many proper:es associated with different occupa:ons including tasks associated with a job and skills that are needed as well as the typical personality traits of people within the profession. hNps://www.onetonline.org/find/ Recruitment/job websites provide many jobs by region as well as structured searches. For example, by technology, by company, by salary etc. Providing a range of examples of jobs and metadata about them. “recruitment websites” -> hNps://uk.indeed.com/jobs? q=soHware+developer&l=Belfast%2C+County+Antrim 2.Higher Educa:on University posi:ons tend to be posted to a small number of specialised sites. For the UK :mes higher is one of the more common sites. hNps://www.:meshighereduca:on.com/unijobs/lis:ngs/united-kingdom/ II.Lenses The lenses within this sec:on are places where useful informa:on is commonly found. You can think of them as represen:ng different groups that are interested in analysing organisa:ons and occupa:ons. A.Financial The goal of this sec:on is to try to iden:fy useful data sources that provide financial informa:on. The boNom line for many organisa:ons is that they need to remain profitable to remain in business. 1.Companies house a)So9ware development “Comapnies house lookup business” -> hNps://www.gov.uk/get-informa:on-about-a-company -> hNps://find-and-update.company-informa:on.service.gov.uk/search?q=Kainos -> hNps://find-and- update.company-informa:on.service.gov.uk/company/NI019370 -> hNps://find-andupdate.company-informa:on.service.gov.uk/company/NI019370/filing-history This includes a history of financial accounts for the business as well as changes in directorships and shares. For larger public companies these documents can also include lists of corporate risks that the organisa:on is concerned with and what ac:ons they are taking to mi:gate them. These risks are similar to the Issues that we may have iden:fied except in the case of our issues we are focused on problems that have occurred in the past and so may be a risk that is not being fully addressed. This is similar (possibly iden:cal to) the shareholders reports that public companies compile. In terms of financial informa:on the reports include figures for revenue, profit and the value of any assets owned by the business. b)Higher educa3on For Higher educa:on a search for Queen’s university on Companies House revealed two organisa:ons that are a part of the university but not he university itself. “queen's university belfast company registra:on number” This links to the contact details of Queens which doesn’t indicate a company number but does include a charity number. “Northern ireland charity commission NIC101788” -> hNps://www.charitycommissionni.org.uk/ charity-search/?pageNumber=1 -> hNps://www.charitycommissionni.org.uk/charity-details/? regId=101788&subId=0 This site links to the past 3 years of annual reports which includes some breakdown of the financial income and expenditure similar to that for companies house financial records. 2.Shareholders reports (and equivalent) These reports are put together by organisa:ons to fulfil repor:ng obliga:ons and to present an organisa:on in a posi:ve way for investors. They can oHen contain a diverse range of useful informa:on. OHen men:oning issues or successes that the organisa:on has achieved and communica:ng the areas where the organisa:on is currently focused to bring growth. It is useful to compare the issues and values from these reports to the issues and values iden:fied in the interview with the expert. To iden:fy issues and values that the senior members of the organisa:on may not have iden:fied already and may be the basis for novel recommenda:ons. a)So9ware development “Kainos Shareholders Reports” hNps://www.kainos.com/investor-rela:ons/results-and-presenta:ons b)Higher educa3on “queen's university annual report” -> hNps://www.qub.ac.uk/directorates/FinanceDirectorate/ visitors/financial-statements/ These documents go back to 2012 and provide similar detail to company shareholder reports including showing the overall income due to teaching and research. 3.Startups and Investment Organisa:ons that are focused on inves:ng money into companies or who provide data to those that perform investment decisions is another source of useful data. For example, the website Crunchbase tracks investment in companies and a range of other useful informa:on such as key staff, technology usage and media men:ons. hNps://www.crunchbase.com/organiza:on/queens-university-belfast Governments oHen provide financial support and grants to startups and SME businesses. “open data investni” -> hNps://www.opendatani.gov.uk/dataset/open-data-up-to-17-18-csv-fileuploaded-csv-13-to-2016-17 This lists the grants that investni have awarded different organisa:ons. This is par:cularly useful for smaller organisa:ons that don’t have a sufficient web presence or revenue to be tracked by Crunchbase or equivalent. 4.Government analysis Governments oHen analyse industries to track the performance of the economy. These links are explored in a later sec:on on Labour sta:s:cs. For public bodies like universi:es and schools there are government bodies who gather and share financial informa:on. hNps://www.hesa.ac.uk/data-and-analysis/finances -> hNps://www.hesa.ac.uk/data-and-analysis/ finances/table-1 This lists high level informa:on on the income from tui:on fees and research income from 2015. 5.Open projects Open projects are projects that are run by organisa:ons that publicly share informa:on, usually including what work is being done, what it costs and what :me period it is being performed over. a)So9ware development Projects with government A previous search iden:fied bidstats, a website that curates recent contracts between companies and the UK government. These contracts have publicly recorded budgets and :me periods. hNps://bidstats.uk/ -> hNps://bidstats.uk/tenders/?q=kainos+soHware This site only covers contracts over the past year but it also documents the source of the data. These sources could be separately explored to find records of older grants. b)Higher educa3on For higher educa:on a significant amount of research income comes from government funded research. For example in Computer Science EPSRC is the main research funding body. “epsrc archive grant data” -> hNps://gow.epsrc.ukri.org/ -> hNps://gow.epsrc.ukri.org/ NGBOSearchGrants.aspx This provides historical records of research grants and enables searching by organisa:ons, such as Queen’s university, and includes details such as the principle inves:gator and the value of the grant. B.Government sta:s:cs This sec:on is focused on data sources obtained by governments. 1.Census data “census data” -> hNps://www.ons.gov.uk/census/2011census/2011censusdata -> hNps:// www.ons.gov.uk/help/localsta:s:cs -> hNps://www.nomisweb.co.uk/ Nomis web includes data from a range of sources including the census. The census provides fine grained geographical informa:on about where people with different occupa:ons or working in different industries live in the UK. This could be used to help analyse social mobility. For example, to what extent those from poorer areas were geLng jobs in high skill/high paid professions and poten:ally how educa:onal courses might have contributed to this affect. Although with the census being run so infrequently there are rela:vely few sample points to base the analysis on. 2.Labour sta:s:cs The economy and workforce are major priori:es for any government and their civil service will oHen perform annual surveys to understand which areas are growing and which may have issues. In Northern Ireland that agency is NISRA “Nisra” -> hNps://www.nisra.gov.uk/sta:s:cs -> hNps://www.nisra.gov.uk/sta:s:cs/economy -> hNps://www.nisra.gov.uk/sta:s:cs/annual-employee-jobs-surveys/business-register-andemployment-survey A previous search revealed an archive of governmental documents held at Queen’s. It may be a more convenient method of searching for government data sources than naviga:ng exis:ng government sites hNps://niopa.qub.ac.uk/ Searching for labour sta:s:cs in general revealed a helpful api that makes it easier to get and link this informa:on. hNp://www.lmiforall.org.uk/explore_lmi/ and hNp://api.lmiforall.org.uk/ 3.Open Data There has been a movement to gather together datasets and make them available through open data websites. The groups that advocate for open data oHen work with the public to make freedom of informa:on requests. They also can make it much easier to access data that may be hard to find through the government websites where the data is originally stored. Opendatani.gov.uk is the main website that focuses on open data sources in NI. a)So9ware development “open data ni” -> hNps://www.opendatani.gov.uk/group/economy?page=3 -> hNps:// www.opendatani.gov.uk/dataset/northern-ireland-index-of-services/resource/8fdf9b7f-4a85-4c12bc74-a4180e2dc4f3 -> “” In this example the links were invalid so once the data source was iden:fied it was searched for separately and led to the dataset. The dataset provides quarterly sta:s:cs on the output of businesses within broad industry classifica:ons. It has a higher temporal detail than the annual financial statements of organisa:ons but lacks the detail of a specific industry or business but may be useful for measuring events occurring within a year or due to seasonal effects. hNps://www.opendatani.gov.uk/dataset?q=job&sort=score+desc%2C+metadata_modified+desc This lists a range of job datasets including: hNps://www.nisra.gov.uk/sta:s:cs/annual-employee-jobs-surveys/business-register-andemployment-survey Which contains annual es:mates for the number of jobs in each SIC industry within NI including a separa:on by full-:me/part-:me and by male/female. There are also details of the number of public and private jobs by region. There are also companies that gather open data and provide it to help companies make decisions about how they operate. Two such websites are generally useful for understanding companies: hNps://stackshare.io/ which indicates the technology stack used in a company and hNps:// www.importye:.com/ which can be used to iden:fy the suppliers used by an organisa:on based on public informa:on recorded from shipping crates. b)Higher educa3on Open data can also be searched for by directly googling for “dataset”. This will oHen lead to useful government data as well as datasources produced by individuals, oHen through web scrapping. “dataset of university courses” -> hNps://www.hesa.ac.uk/support/tools-and-downloads/unistats “dataset of university courses” -> hNps://www.kaggle.com/tags/universi:es-and-colleges Kaggle is a good source for global scoped datasets. Many of which have been web scrapped or otherwise gathered and linked together by data analysis enthusiasts. C.Career When examining an industry from a career perspec:ve we are focused on the groups that are helping job seekers or employers. Two of the most broadly useful are hNps://www.linkedin.com which oHen has pages for companies linking to staff and in turn their CVs. Company review sites such as hNps://www.glassdoor.co.uk/ this includes a range of informa:on including comments about the company culture, management style and interview procedures which are difficult to obtain elsewhere. Some companies put effort into keeping the reviews posi:ve so it is worth bearing this poten:al bias in mind when using the data. However, factual informa:on about the organisa:ons is likely to be correct. Whenever you find a website with poten:ally useful informa:on but no dataset it is always worth googling for “nameOfCompany/website dataset” or “nameOfCompany/ website kaggle” or similar searches. For example googling this with Glassdoor returned this hNps://www.kaggle.com/rkb0023/glassdoor-data-science-jobs D.Research This sec:on is focused on data sources produced by academics. One difficulty with these data sources is that there can oHen be a very large amount of discussion over a very small amount of data. Some subjects tend to have very liNle in terms of shared common datasets that academics work on. For some subjects, compared with industrial research, academic funding bodies tend to focus more strongly on original processes and interpreta:ons over data collec:on which can be labourious but not necessarily novel. In par:cular a lot of management and soHware engineering research is based on case studies whose data is not shared. Despite these limita:ons, the following tools can be useful in finding academic datasets. 1.Google dataset search This lists datasets from many sources including open data and academic datasets. a)So9ware development hNps://datasetsearch.research.google.com/ -> hNps://datasetsearch.research.google.com/search? query=github&docid=L2cvMTFyMWsyaGI5cw%3D%3D -> hNps://www.kaggle.com/davidshinn/ github-issues -> hNps://www.kaggle.com/rtatman/we-got-issues-topic-modelling-of-github-issues Searching for github leads to a kaggle dataset of issues recorded for a large range of github repositories. This also has a shared set of python code that clusters the data into different topics. The analysis is not par:cularly informa:ve, however the dataset maybe useful in understanding the space of issues/bugs that occur in soHware to get a sense of the most common types of bugs that occur in soHware. b)Higher educa3on “Typing postgraduate into google dataset search produced an autocomplete of HE student enrolments postgraduate aged 25 to 29 UK 2018/19, by subject area” This is a rela:vely coarse aggregate informa:on. Searching for HE student enrolments postgraduate provides a larger list of datasets including hNps://data.gov.uk/dataset/43036fff-8c2f-49a8-b050a401f7a9858c/projected-demand-for-places-at-higher-educa:on-ins:tu:ons-in-london which focuses on projected places within London, examining the methodology used might enable a similar analysis to be performed in other regions e.g. NI. 2.Papers with code This website is focused on datasets and code for machine learning. Most leading machine learning models are trained on the largest datasets concerning a topic. Most machine learning is focused on task level informa:on, as the datasets are usually quite large. For example, in architecture it can be difficult to get data about floor plans as culturally architecture doesn’t have the same open source values as soHware. However, there are large datasets available through papers with code (hNps:// paperswithcode.com/datasets?q=floor+plan&v=lst&o=match) which would be very difficult to obtain elsewhere. a)So9ware development hNps://paperswithcode.com/datasets?q=SoHware+Engineering&v=lst&o=match -> hNps:// paperswithcode.com/dataset/con:nuous-defect-predic:on -> hNps://arxiv.org/pdf/ 1703.04142v2.pdf SoHware engineering is a category on the site and leads to a large dataset that has integrated data about commits to source code with informa:on from con:nuous build pipelines. This informa:on could be used to explore common reasons why builds fail as well as sta:s:cs on how long build pipelines take revealing some insight into the causes of delays in soHware development prac:ces. b)Higher educa3on It is harder to find good search terms for higher educa:on or university as many datasets include these words in their descrip:on. However a search for courses led to hNps://paperswithcode.com/ datasets?q=courses&v=lst&o=match -> hNps://paperswithcode.com/dataset/webkb The data comes from a rela:vely old scrape of 4 university websites in 1997. By itself it may be of limited value although it shows that at least in 1997 there were web pages produced by students that were publicly available. It is possible that they have now moved to Facebook (the original purpose of facebook) however it highlights a poten:ally useful data source for analysing the student experience. 3.UK data service In the UK, datasets gathered as part of social sciences research grants are shared via the UK data service. hNps://ukdataservice.ac.uk/ Much of this data requires explicitly describing the project with a fixed :meline of the use of the data. You may also require an academic to apply for the data in order for it to be approved. a)So9ware development hNps://beta.ukdataservice.ac.uk/datacatalogue/studies/#!? Search=soHware%20development&Page=1&Rows=10&Sort=1&DateFrom=440&DateTo=2021 Has a range of interes:ng analysis that may be quite relevant to making recommenda:ons that will lead to improved outcomes for industries. For example hNps://beta.ukdataservice.ac.uk/ datacatalogue/studies/study?id=851816 includes an analysis of government policies and company management approaches to skill development. This is poten:ally relevant to the inves:ga:on of skill development as a causal factor in determining staff turnover. b)Higher educa3on hNps://beta.ukdataservice.ac.uk/datacatalogue/studies/#!? Search=higher%20educa:on&Rows=10&Sort=1&DateFrom=440&DateTo=2021&Page=1 -> hNps:// beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=851752 This lead to a study from South Africa on the impact of higher educa:on on poverty reduc:on. This goal was iden:fied as a possible value for the minister for educa:on role. The analysis is focused on South Africa before and aHer apartheid so the informa:on may not generalise as well to other regions. The research may reference more relevant studies and include methodologies that could be adopted for your region of study. 4.Google Scholar Academic research tends to be very narrowly focused so it can be difficult to find good data driven analysis and on industries or occupa:ons. In a previous project we found this excellent report on the clothing industry (hNps://www.ifm.eng.cam.ac.uk/insights/sustainability/well-dressed/). It provides a valuable data driven perspec:ve of how money is distributed amongst the various stages of the design, manufacture and sale of a product. It is likely that similar reports exist for other industries but we have not yet found the key words or the community where research like this is shared. Google scholar is one of the most comprehensive search engines for academic papers. hNps://scholar.google.com/scholar? hl=en&as_sdt=0%2C5&q=Analysis+SoHware+development+industry&btnG= -> hNps:// ieeexplore.ieee.org/abstract/document/844576? casa_token=7MYEspS04Z0AAAAA:BWu2O9prR5RX4HVPRVhr5QDiOSRG7JaZXuffBsicvEw39u6jQnopOyiOy3BBonfiy8aXcq9VsM It can also be useful to explicitly focus on very high status research as it oHen uses much larger datasets. “journal rankings industry analysis” -> hNps://www.scimagojr.com/journalrank.php?category=1403 Using advanced search op:ons on google scholar you can search within a certain journal hNps:// semo.libguides.com/google-scholar/advanced-searching hNps://scholar.google.com/scholar? hl=en&as_sdt=0%2C5&q=source%3AAcademy+source%3Aof+source%3AManagement+source%3AJo urnal+soHware+development&btnG= This lists a range of high quality management research within soHware development. For example hNps://journals.aom.org/doi/abs/10.5465/AMJ.2008.31767300 Is a publica:on on a longitudinal study of 884 soHware firms and the impact of their technology strategy on their financial success. Unfortunately this research is behind a paywall. E.Professional associa:ons, market analysis This sec:on is focused on data sources that are produced by organisa:on that exist to support companies and workers within an industry. 1.SoHware Development “soHware development industry sta:s:cs” one of the links (hNps://www.cbi.eu/market-informa:on/ outsourcing-itobpo/soHware-development-services/market-poten:al) referenced the stack overflow survey. “stack overflow survey” -> hNps://insights.stackoverflow.com/survey -> hNps:// insights.stackoverflow.com/survey/2020 From the website: In February 2020 nearly 65,000 developers told us how they learn and level up, which tools they’re using, and what they want. This is one of the larger surveys of soHware developers and includes many ques:ons relevant to making recommenda:ons to individuals and organisa:ons. 2.Higher Educa:on “universi:es in the uk” included a link to an advocacy group for Universi:es in the UK. They have hNps://www.universi:esuk.ac.uk/policy-and-analysis/reports/Pages/reports.aspx. This includes highlights and analysis from government reports such as hNps://www.universi:esuk.ac.uk/policyand-analysis/reports/Pages/scale-UK-HE-TNE-2018-19-ScoLsh-providers.aspx This provides high level overview of the number of students studying on TNE courses which are typically online distance learning modules taken by students in other countries. These courses can be rela:vely profitable for universi:es and are not subject to student caps as is the case for local students on undergraduate programs. III.Values The goal of this sec:on is to search more directly for data sources rela:ng to values, issues and causes you feel are promising from the previous week. Based on other data you have iden:fied you may have new areas that you feel might have a more significant causal impact or relate to more important issues or values. If so, update this sec:on to focus on those. In par:cular, you may wish to explore data sources related to your unques:oned assump:ons as these may be more likely to lead to novel recommenda:ons. A.SoHware Development 1.From an employee’s perspec:ve a)Issue – Choosing to leave The previous searches iden:fied the ASHE (Annual survey of hours and earnings). This includes a range of metadata and a number of specific analysis. For example, comparing the salaries of job changers vs job stayers. hNps://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/ adhocs/009953annualsurveyo€oursandearningsasheanalysiso•obchangersandstayers A range of metrics are examined split by industry or region (but not both). b)Value – High status posi3on Following a previous search hNps://www.hesa.ac.uk/support/tools-and-downloads/unistats -> hNps://www.hesa.ac.uk/data-and-analysis/graduates/salaries This lists median incomes for graduate survey responses from each ins:tu:on. hNps://www.economy-ni.gov.uk/topics/sta:s:cs-and-economic-research/higher-and-furthereduca:on-and-training-sta:s:cs -> hNps://www.economy-ni.gov.uk/ar:cles/longitudinal-educa:onoutcomes-northern-ireland-data-linkage-ini:a:ve This longitudinal study may contain valuable insights into careers of graduates. Unfortunately the data is restricted. However it may be possible to access this informa:on as part of an approved research project or in some cases by reques:ng sta:s:cal summaries from it directly from those that manage the informa:on. For example, we have had success in the past contac:ng NISRA directly asking for data about a specific issue. 2.From a founder’s perspec:ve a)Value - Profitable projects b)Value – Projects that lead to repeat business B.Higher Educa:on 1.From a minister for educa:on’s perspec:ve a)Value – Improved economy (1)Metrics (a)Number of jobs in the industry “number of jobs by industry in NI” -> hNps://www.nisra.gov.uk/sta:s:cs/labour-market-and-socialwelfare/quarterly-employment-survey This lists the number of jobs by broad industry classifica:on by year in Northern Ireland For more specialist sectors with corresponding specialist postgraduate programs, for example cyber security: “number of cyber security jobs in NI” -> hNps://www.itjobswatch.co.uk/jobs/northern%20ireland/ cyber%20security.do This is a poten:ally very valuable set of sta:s:cs. The methodology sec:on doesn’t clarify the process used sufficiently to replicate it (hNps://www.itjobswatch.co.uk/about.aspx) so it is unclear what sources they are using. However most data sources have some bias and incompleteness to them this signal may s:ll be very valuable. If a list of companies within the sector can be obtained they can be analysed individually using some of the data sources iden:fied above to determine numbers of employees, for example data on Crunchbase or from annual reports. (b)Salary of staff within an industry The itjobswatch website had some salary informa:on. “websites like itjobswatch” -> hNp://www.moreofit.com/similar-to/www.itjobswatch.co.uk/ Top_10_Sites_Like_Itjobswatch/ -> hNps://www.checkasalary.co.uk/ This website has historical salary sta:s:cs by industry and specifically for NI. The site describes these values as coming from the ONS. “Salary data from ONS” -> hNps://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/ earningsandworkinghours The ONS gathers a reasonably detailed survey of income and working hours hNps://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/ adhocs/004906annualsurveyo€oursandearnings The ASHE results tables provide es3mates on the levels and distribu3on of earnings and paid hours worked for employees in the UK. Es3mates are available for a variety of breakdowns: by age group, industry, occupa3on, public / private sector and a range of geographies. “ons ashe data” -> hNps://www.ons.gov.uk/searchdata?q=Annual+Survey+of+Hours+and+Earnings+table+16 -> hNps:// www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/datasets/ industry4digitsic2007ashetable16 This data is focused on rela:vely broad industry categories. For more fine grained salaries in a specific industry and role e.g. Cybersecurity penetra:on testers, it may be more valuable to sample job adverts. (2)Causal factors (a)Number of postgraduate students on a degree program “dataset of number of students by postgraduate course” -> hNps://www.hesa.ac.uk/support/toolsand-downloads/unistats The unistats dataset has some high level informa:on but lacks the detailed per postgraduate course student numbers that would be most valuable. As universi:es are subject to freedom of informa:on requests it may be possible to request the numbers of enrolled postgraduate students associated with each course at each university. “nisra educa:on sta:s:cs” -> hNps://www.nisra.gov.uk/sta:s:cs/children-educa:on-and-skills/ higher-and-further-educa:on-and-training-sta:s:cs -> hNps://www.economy-ni.gov.uk/topics/ sta:s:cs-and-economic-research/higher-and-further-educa:on-and-training-sta:s:cs -> hNps:// www.economy-ni.gov.uk/ar:cles/higher-educa:on-enrolments These datasets provide numbers for postgraduate students as a whole and numbers of undergraduates by subject. The finer grained detail of number of postgraduate students per course and university are not available. It may be possible to examine the broad causal impact of numbers of students on undergraduate programs as these are more precisely iden:fied. IV.Conclusions & Further Work What have you learnt from this analysis? What areas do you think could lead to the most impaczul recommenda:ons? A.Conclusions Which areas are most promising for finding high impact recommenda:ons for different roles? B.Further work To help those that are looking to build on your analysis, iden:fy areas you think would be valuable to focus on expanding. For example, you might note that you only examined one funding agency for higher educa:on and that it would be valuable to systema:cally gather references to all the main funding bodies so that a complete breakdown of public grants could be performed. This could form the basis for making valuable recommenda:ons about how to improve research income, par:cularly if this data could be linked to the careers of the staff involved and explore how academic careers develop and lead to research grant income. Similarly there are many different government data sources regarding analysis of the labour force and it would be useful to document them and what they contain in more detail. Other sources of data that are not online could also be worked with. For example, freedom of informa:on requests could be made, sta:s:cs bodies like NISRA can be contacted for infroma:on, organisa:ons can be contacted directly e.g. the director of educa:on could be contacted to ask about student numbers on postgraduate courses or companies could be contacted to ask about the number of branches they have and when they were setup. Also, it can be valuable to post to social media sites such as hNps://www.reddit.com/r/datasets/ or hNps://www.quora.com/ or hNps:// www.researchgate.net/topics. V.Appendix Discuss any steps you have taken to make your work more useful for a data analyst like yourself looking to build on your work. The list of lenses is not necessarily comprehensive. Some of the lenses may also be less relevant for some industries. If you find valuable new perspec:ves then altering the sec:ons of this document to make it more effec:ve for your industry will lead to improved marks for this chapter. Make sure to comment on why you have made the changes you have in the appendix sec:on.