I have a homework project with web scraping and am suppose to collect all the even information for a month from a school website. I am using Python with Requests and Beautiful Soup. I have written some code to grab a url and am trying to grab all of the li's from the page that hold the event information. However, when I go to grab all of the li content I noticed that I am not receiving all of them. I have been thinking it is due to the style of "overflow:hidden" for the ul but why am I able to get the first few li's then?
from bs4 import BeautifulSoup
import requests
url = 'https://apps.iu.edu/ccl-prd/events/view?date=06012016&type=day&pubCalId=GRP1322'
r = requests.get(url)
bsObj = BeautifulSoup(r.text,"html.parser")
eventList = []
eventURLs = bsObj.find_all("a",href=True)
print len(eventURLs)
count = 1
for url in eventURLs:
print str(count) + '. ' + url['href']
count += 1
I am printing out the urls because I plan on going to the href link inside of the events to get the full descriptions and other metadata provided. However, I am not getting all of the event lis. I am only getting the first 5. The links in the output that I get that are for the events are numbers 19 to 23. The page has 10 total events though.
output:
1. https://www.indiana.edu/
2. #advancedSearch
3. /ccl-prd/events/view?type=week&date=06012016&pubCalId=GRP1322
4. /ccl-prd/events/view?type=month&date=06012016&pubCalId=GRP1322
5. /ccl-prd/events/view?type=day&date=06222016&pubCalId=GRP1322
6. /ccl-prd/events/view?pubCalId=GRP1432&type=day&date=06012016
7. /ccl-prd/events/view?pubCalId=GRP1445&type=day&date=06012016
8. /ccl-prd/events/view?pubCalId=GRP1436&type=day&date=06012016
9. /ccl-prd/events/view?pubCalId=GRP1438&type=day&date=06012016
10. /ccl-prd/events/view?pubCalId=GRP1440&type=day&date=06012016
11. /ccl-prd/events/view?pubCalId=GRP1443&type=day&date=06012016
12. /ccl-prd/events/view?pubCalId=GRP1434&type=day&date=06012016
13. /ccl-prd/events/view?pubCalId=GRP1447&type=day&date=06012016
14. /ccl-prd/events/view?pubCalId=GRP1450&type=day&date=06012016
15. http://newsinfo.iu.edu/
16. http://www.indiana.edu/~iuvis/
17. /ccl-prd/events/view?type=day&date=06012016&iub=BL011&pubCalId=GRP1322
18. /ccl-prd/events/view?type=day&date=06012016&iub=BL153&pubCalId=GRP1322
19. /ccl-prd/events/view/13147231?viewParams=%26type%3dday%26date%3d06012016&theDate=06222016&referrer=listView&pubCalId=GRP1322
20. /ccl-prd/events/view/13163329?viewParams=%26type%3dday%26date%3d06012016&referrer=listView&pubCalId=GRP1322
21. /ccl-prd/events/view/13163465?viewParams=%26type%3dday%26date%3d06012016&theDate=06222016&referrer=listView&pubCalId=GRP1322
22. /ccl-prd/events/view/13110443?viewParams=%26type%3dday%26date%3d06012016&theDate=06222016&referrer=listView&pubCalId=GRP1322
23. /ccl-prd/events/view/11744967?viewParams=%26type%3dday%26date%3d06012016&theDate=06222016&referrer=listView&pubCalId=GRP1322
24. http://www.iu.edu/copyright/index.shtml
25. http://www.iu.edu/
TLDR: I am not getting all the links from the lis on a page when I use Python requests and beautiful soup. Why am I not getting the links and is there a better way of going about this problem?
Edited to give answer: The links I needed were all being created with Javascript and since Requests and Beautiful soup do not run the Javascript I have instead moved to Selenium with PhantomJS. However, an answer below shows how to get the information created by Javascript by using parameters in Python Requests which is a perfect way of doing this!
Aucun commentaire:
Enregistrer un commentaire